HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2015 International Conference on the World Wide Web

Fullname:Proceedings of the 24th International Conference on World Wide Web
Editors:Aldo Gangemi; Stefano Leonardi; Alessandro Panconesi
Location:Florence, Italy
Dates:2015-May-18 to 2015-May-22
Standard No:ISBN: 978-1-4503-3469-3; ACM DL: Table of Contents; hcibib: WWW15-1
Links:Conference Website

Companion Proceedings of the 2015 International Conference on the World Wide Web

Fullname:Companion Proceedings of the 24th International Conference on World Wide Web
Editors:Aldo Gangemi; Stefano Leonardi; Alessandro Panconesi
Location:Florence, Italy
Dates:2015-May-18 to 2015-May-22
Standard No:ISBN: 978-1-4503-3473-0; ACM DL: Table of Contents; hcibib: WWW15-2
Links:Conference Website
  1. WWW 2015-05-18 Volume 1
    1. Technical Papers
    2. Technical Papers 2
  2. WWW 2015-05-18 Volume 2
    1. Posters
    2. Demonstrations
    3. WebSci Track Papers & Posters
    4. Industrial Track
    5. PhD Symposium
    6. AW4CITY 2015
    7. BigScholar 2015
    8. DAEN 2015
    9. KET 2015
    10. LiLE 2015
    11. LIME 2015
    12. LocWeb 2015
    13. MSM 2015
    14. MWA 2015
    15. NewsWWW 2015
    16. OOEW 2015
    17. RDSM 2015
    18. SAVE-SD 2015
    19. SIMPLEX 2015
    20. SocialNLP 2015
    21. SOCM 2015
    22. SWDM 2015
    23. TargetAd 2015
    24. TempWeb 2015
    25. WDS4SC 2015
    26. WebET 2015
    27. WebQuality 2015
    28. WIC 2015
    29. WSREST 2015
    30. Tutorials
    31. Workshop Summaries

WWW 2015-05-18 Volume 1

Technical Papers

Optimizing Display Advertising in Online Social Networks BIBAFull-Text 1-11
  Zeinab Abbassi; Aditya Bhaskara; Vishal Misra
Advertising is a significant source of revenue for most online social networks. Conventional online advertising methods need to be customized for online social networks in order to address their distinct characteristics. Recent experimental studies have shown that providing social cues along with ads, e.g. information about friends liking the ad or clicking on an ad, leads to higher click rates. In other words, the probability of a user clicking an ad is a function of the set of friends that have clicked the ad. In this work, we propose formal probabilistic models to capture this phenomenon, and study the algorithmic problem that then arises. Our work is in the context of display advertising where a contract is signed to show an ad to a pre-determined number of users. The problem we study is the following: given a certain number of impressions, what is the optimal display strategy, i.e. the optimal order and the subset of users to show the ad to, so as to maximize the expected number of clicks? Unlike previous models of influence maximization, we show that this optimization problem is hard to approximate in general, and that it is related to finding dense subgraphs of a given size. In light of the hardness result, we propose several heuristic algorithms including a two-stage algorithm inspired by influence-and-exploit strategies in viral marketing. We evaluate the performance of these heuristics on real data sets, and observe that our two-stage heuristic significantly outperforms the natural baselines.
Frankenplace: Interactive Thematic Mapping for Ad Hoc Exploratory Search BIBAFull-Text 12-22
  Benjamin Adams; Grant McKenzie; Mark Gahegan
Ad hoc keyword search engines built using modern information retrieval methods do a good job of handling fine-grained queries. However, they perform poorly at facilitating spatial and spatially-embedded thematic exploration of the results, despite the fact that many queries, e.g. "civil war," refer to different documents and topics in different places. This is not for lack of data: geographic information, such as place names, events, and coordinates are common in unstructured document collections on the web. The associations between geographic and thematic contents in these documents can provide a rich groundwork to organize information for exploratory research. In this paper we describe the architecture of an interactive thematic map search engine, Frankenplace, designed to facilitate document exploration at the intersection of theme and place. The map interface enables a user to zoom the geographic context of their query in and out, and quickly explore through thousands of search results in a meaningful way. And by combining topic models with geographically contextualized search results, users can discover related topics based on geographic context. Frankenplace utilizes a novel indexing method called geoboost for boosting terms associated with cells on a discrete global grid. The resulting index factors in the geographic scale of the place or feature mentioned in related text, the relative textual scope of the place reference, and the overall importance of the containing document in the document network. The system is currently indexed with over 5 million documents from the web, including the English Wikipedia and online travel blog entries. We demonstrate that Frankenplace can support four distinct types of exploratory search tasks while being adaptive to scale and location of interest.
Towards Reconciling SPARQL and Certain Answers BIBAFull-Text 23-33
  Shqiponja Ahmetaj; Wolfgang Fischl; Reinhard Pichler; Mantas Šimkus; Sebastian Skritek
SPARQL entailment regimes are strongly influenced by the big body of works on ontology-based query answering, notably in the area of Description Logics (DLs). However, the semantics of query answering under SPARQL entailment regimes is defined in a more naive and much less expressive way than the certain answer semantics usually adopted in DLs. The goal of this work is to introduce an intuitive certain answer semantics also for SPARQL and to show the feasibility of this approach. For OWL 2 QL entailment, we present algorithms for the evaluation of an interesting fragment of SPARQL (the so-called well-designed SPARQL). Moreover, we show that the complexity of the most fundamental query analysis tasks (such as query containment and equivalence testing) is not negatively affected by the presence of OWL 2 QL entailment under the proposed semantics.
Donor Retention in Online Crowdfunding Communities: A Case Study of DonorsChoose.org BIBAFull-Text 34-44
  Tim Althoff; Jure Leskovec
Online crowdfunding platforms like DonorsChoose.org and Kickstarter allow specific projects to get funded by targeted contributions from a large number of people. Critical for the success of crowdfunding communities is recruitment and continued engagement of donors. With donor attrition rates above 70%, a significant challenge for online crowdfunding platforms as well as traditional offline non-profit organizations is the problem of donor retention. We present a large-scale study of millions of donors and donations on DonorsChoose.org, a crowdfunding platform for education projects. Studying an online crowdfunding platform allows for an unprecedented detailed view of how people direct their donations. We explore various factors impacting donor retention which allows us to identify different groups of donors and quantify their propensity to return for subsequent donations. We find that donors are more likely to return if they had a positive interaction with the receiver of the donation. We also show that this includes appropriate and timely recognition of their support as well as detailed communication of their impact. Finally, we discuss how our findings could inform steps to improve donor retention in crowdfunding communities and non-profit organizations.
Budget-Constrained Item Cold-Start Handling in Collaborative Filtering Recommenders via Optimal Design BIBAFull-Text 45-54
  Oren Anava; Shahar Golan; Nadav Golbandi; Zohar Karnin; Ronny Lempel; Oleg Rokhlenko; Oren Somekh
It is well known that collaborative filtering (CF) based recommender systems provide better modeling of users and items associated with considerable rating history. The lack of historical ratings results in the user and the item cold-start problems. The latter is the main focus of this work. Most of the current literature addresses this problem by integrating content-based recommendation techniques to model the new item. However, in many cases such content is not available, and the question arises is whether this problem can be mitigated using CF techniques only. We formalize this problem as an optimization problem: given a new item, a pool of available users, and a budget constraint, select which users to assign with the task of rating the new item in order to minimize the prediction error of our model. We show that the objective function is monotone-supermodular, and propose efficient optimal design based algorithms that attain an approximation to its optimum. Our findings are verified by an empirical study using the Netflix dataset, where the proposed algorithms outperform several baselines for the problem at hand.
Improved Theoretical and Practical Guarantees for Chromatic Correlation Clustering BIBAFull-Text 55-65
  Yael Anava; Noa Avigdor-Elgrabli; Iftah Gamzu
We study a natural generalization of the correlation clustering problem to graphs in which the pairwise relations between objects are categorical instead of binary. This problem was recently introduced by Bonchi et al. under the name of chromatic correlation clustering, and is motivated by many real-world applications in data-mining and social networks, including community detection, link classification, and entity de-duplication. Our main contribution is a fast and easy-to-implement constant approximation framework for the problem, which builds on a novel reduction of the problem to that of correlation clustering. This result significantly progresses the current state of knowledge for the problem, improving on a previous result that only guaranteed linear approximation in the input size. We complement the above result by developing a linear programming-based algorithm that achieves an improved approximation ratio of 4. Although this algorithm cannot be considered to be practical, it further extends our theoretical understanding of chromatic correlation clustering. We also present a fast heuristic algorithm that is motivated by real-life scenarios in which there is a ground-truth clustering that is obscured by noisy observations. We test our algorithms on both synthetic and real datasets, like social networks data. Our experiments reinforce the theoretical findings by demonstrating that our algorithms generally outperform previous approaches, both in terms of solution cost and reconstruction of an underlying ground-truth clustering.
Global Diffusion via Cascading Invitations: Structure, Growth, and Homophily BIBAFull-Text 66-76
  Ashton Anderson; Daniel Huttenlocher; Jon Kleinberg; Jure Leskovec; Mitul Tiwari
Many of the world's most popular websites catalyze their growth through invitations from existing members. New members can then in turn issue invitations, and so on, creating cascades of member signups that can spread on a global scale. Although these diffusive invitation processes are critical to the popularity and growth of many websites, they have rarely been studied, and their properties remain elusive. For instance, it is not known how viral these cascades structures are, how cascades grow over time, or how diffusive growth affects the resulting distribution of member characteristics present on the site. In this paper, we study the diffusion of LinkedIn, an online professional network comprising over 332 million members, a large fraction of whom joined the site as part of a signup cascade. First we analyze the structural patterns of these signup cascades, and find them to be qualitatively different from previously studied information diffusion cascades. We also examine how signup cascades grow over time, and observe that diffusion via invitations on LinkedIn occurs over much longer timescales than are typically associated with other types of online diffusion. Finally, we connect the cascade structures with rich individual-level attribute data to investigate the interplay between the two. Using novel techniques to study the role of homophily in diffusion, we find striking differences between the local, edge-wise homophily and the global, cascade-level homophily we observe in our data, suggesting that signup cascades form surprisingly coherent groups of members.
Recommendation Subgraphs for Web Discovery BIBAFull-Text 77-87
  Arda Antikacioglu; R. Ravi; Srinath Sridhar
Recommendations are central to the utility of many popular e-commerce websites. Such sites typically contain a set of recommendations on every product page that enables visitors and crawlers to easily navigate the website. These recommendations are essentially universally present on all e-commerce websites. Choosing an appropriate set of recommendations at each page is a critical task performed by dedicated backend software systems. We formalize the concept of recommendations used for discovery as a natural graph optimization problem on a bipartite graph and propose three methods for solving the problem in increasing order of sophistication: a local random sampling algorithm, a greedy algorithm and a more involved partitioning based algorithm. We first theoretically analyze the performance of these three methods on random graph models and characterize when each method will yield a solution of sufficient quality and the parameter ranges when more sophistication is needed. We complement this by providing an empirical analysis of these algorithms on simulated and real-world production data from a retail website. Our results confirm that it is not always necessary to implement complicated algorithms in the real-world, and demonstrate that very good practical results can be obtained by using simple heuristics that are backed by the confidence of concrete theoretical guarantees.
Is Sniping A Problem For Online Auction Markets? BIBAFull-Text 88-96
  Matt Backus; Thomas Blake; Dimitriy V. Masterov; Steven Tadelis
A common complaint about online auctions for consumer goods is the presence of "snipers," who place bids in the final seconds of sequential ascending auctions with predetermined ending times. The literature conjectures that snipers are best-responding to the existence of "incremental" bidders that bid up to their valuation only as they are outbid. Snipers aim to catch these incremental bidders at a price below their reserve, with no time to respond. As a consequence, these incremental bidders may experience regret when they are outbid at the last moment at a price below their reservation value. We measure the effect of this experience on a new buyer's propensity to participate in future auctions. We show the effect to be causal using a carefully selected subset of auctions from eBay.com and instrumental variables estimation strategy. Bidders respond to sniping quite strongly and are between 4 and 18 percent less likely to return to the platform.
Essential Web Pages Are Easy to Find BIBAFull-Text 97-107
  Ricardo Baeza-Yates; Paolo Boldi; Flavio Chierichetti
In this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click frequencies. We provide a possible formal definition for the notion of essential web pages as those that cover a large fraction of distinct queries -- i.e., we look at the problem as a version of MaxCover. Although in general MaxCover is approximable to within a factor of 1-1/e 0.632 from the optimum, we provide a condition under which the greedy algorithm does find the actual best cover (or remains at a known bounded factor from it). The extra check for optimality (or for bounding the ratio from the optimum) comes at a negligible algorithmic cost. Moreover, in most practical instances of this problem, the algorithm is able to provide solutions that are provably optimal, or close to optimal. We relate this observed phenomenon to some properties of the queries' click graph. Our experimental results confirm that a small number of web pages can respond to a large fraction of the queries (e.g., 0.4% of the pages answers 20% of the queries). Our approach can be used in several related search applications, and has in fact an even more general appeal -- as a first example, our preliminary experimental study confirms that our algorithm has extremely good performances on other (social network based) MaxCover instances.
Design and Analysis of Benchmarking Experiments for Distributed Internet Services BIBAFull-Text 108-118
  Eytan Bakshy; Eitan Frachtenberg
The successful development and deployment of large-scale Internet services depends critically on performance. Even small regressions in processing time can translate directly into significant energy and user experience costs. Despite the widespread use of distributed server infrastructure (e.g., in cloud computing and Web services), there is little research on how to benchmark such systems to obtain valid and precise inferences with minimal data collection costs. Correctly A/B testing distributed Internet services can be surprisingly difficult because interdependencies between user requests (e.g., for search results, social media streams, photos) and host servers violate assumptions required by standard statistical tests. We develop statistical models of distributed Internet service performance based on data from Perflab, a production system used at Facebook which vets thousands of changes to the company's codebase each day. We show how these models can be used to understand the tradeoffs between different benchmarking routines, and what factors must be taken into account when performing statistical tests. Using simulations and empirical data from Perflab, we validate our theoretical results, and provide easy-to-implement guidelines for designing and analyzing such benchmarks.
ACCAMS: Additive Co-Clustering to Approximate Matrices Succinctly BIBAFull-Text 119-129
  Alex Beutel; Amr Ahmed; Alexander J. Smola
Matrix completion and approximation are popular tools to capture a user's preferences for recommendation and to approximate missing data. Instead of using low-rank factorization we take a drastically different approach, based on the simple insight that an additive model of co-clusterings allows one to approximate matrices efficiently. This allows us to build a concise model that, per bit of model learned, significantly beats all factorization approaches in matrix completion. Even more surprisingly, we find that summing over small co-clusterings is more effective in modeling matrices than classic co-clustering, which uses just one large partitioning of the matrix. Following Occam's razor principle, the fact that our model is more concise and yet just as accurate as more complex models suggests that it better captures the latent preferences and decision making processes present in the real world. We provide an iterative minimization algorithm, a collapsed Gibbs sampler, theoretical guarantees for matrix approximation, and excellent empirical evidence for the efficacy of our approach. We achieve state-of-the-art results for matrix completion on Netflix at a fraction of the model complexity.
Who, What, When, and Where: Multi-Dimensional Collaborative Recommendations Using Tensor Factorization on Sparse User-Generated Data BIBAFull-Text 130-140
  Preeti Bhargava; Thomas Phan; Jiayu Zhou; Juhan Lee
Given the abundance of online information available to mobile users, particularly tourists and weekend travelers, recommender systems that effectively filter this information and suggest interesting participatory opportunities will become increasingly important. Previous work has explored recommending interesting locations; however, users would also benefit from recommendations for activities in which to participate at those locations along with suitable times and days. Thus, systems that provide collaborative recommendations involving multiple dimensions such as location, activities and time would enhance the overall experience of users.The relationship among these dimensions can be modeled by higher-order matrices called tensors which are then solved by tensor factorization. However, these tensors can be extremely sparse. In this paper, we present a system and an approach for performing multi-dimensional collaborative recommendations for Who (User), What (Activity), When (Time) and Where (Location), using tensor factorization on sparse user-generated data. We formulate an objective function which simultaneously factorizes coupled tensors and matrices constructed from heterogeneous data sources. We evaluate our system and approach on large-scale real world data sets consisting of 588,000 Flickr photos collected from three major metro regions in USA. We compare our approach with several state-of-the-art baselines and demonstrate that it outperforms all of them.
Secrets, Lies, and Account Recovery: Lessons from the Use of Personal Knowledge Questions at Google BIBAFull-Text 141-150
  Joseph Bonneau; Elie Bursztein; Ilan Caron; Rob Jackson; Mike Williamson
We examine the first large real-world data set on personal knowledge question's security and memorability from their deployment at Google. Our analysis confirms that secret questions generally offer a security level that is far lower than user-chosen passwords. It turns out to be even lower than proxies such as the real distribution of surnames in the population would indicate. Surprisingly, we found that a significant cause of this insecurity is that users often don't answer truthfully. A user survey we conducted revealed that a significant fraction of users (37%) who admitted to providing fake answers did so in an attempt to make them "harder to guess" although on aggregate this behavior had the opposite effect as people "harden" their answers in the same and predictable way. On the usability side, we show that secret answers have surprisingly poor memorability despite the assumption that their reliability motivates their continued deployment. From millions of account recovery attempts we observed a significant fraction of users (e.g 40% of our English-speaking US users) were unable to recall their answers when needed. This is lower than the success rate of alternative recovery mechanisms such as SMS reset codes (over 80%). Comparing question strength and memorability reveals that the questions that are potentially the most secure (e.g what is your first phone number) are also the ones with the worst memorability. We conclude that it appears next to impossible to find secret questions that are both secure and memorable. Secret questions continue have some use when combined with other signals, but they should not be used alone and best practice should favor more reliable alternatives.
Supporting Ethical Web Research: A New Research Ethics Review BIBAFull-Text 151-161
  Anne Bowser; Janice Y. Tsai
Research ethics is an important and timely topic. In academia, federally regulated Institutional Review Boards (IRBs) protect participants of human subjects research, and offer researchers a mechanism to assess the ethical implications of their work. Industry research labs are not subject to the same requirements, and may lack processes for research ethics review. We describe the creation of a new ethics framework and a research ethics submission system (RESS) within Microsoft Research (MSR). This RESS is customized to the needs of web researchers. We describe our iterative development process, including our assessment of the current state of web research, developing a framework of methods based on a survey of 358 research papers; build and evaluate our system with 14 users to identify the benefits and pitfalls of full deployment; evaluate how our system matches with existing federal regulations; and, suggest next steps for supporting ethical web research.
Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing BIBAFull-Text 162-172
  Aniket Chakrabarti; Srinivasan Parthasarathy
All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some extent and provide substantial speedup over traditional index based approaches. BayesLSH is used for pruning the candidate space and computation of approximate similarity, whereas BayesLSHLite can only prune the candidates, but similarity needs to be computed exactly on the original data. Thus where ever the explicit data representation is available and exact similarity computation is not too expensive, BayesLSHLite can be used to aggressively prune candidates and provide substantial speedup without losing too much on quality. However, the loss in quality is higher in the BayesLSH variant, where explicit data representation is not available, rather only a hash sketch is available and similarity has to be estimated approximately. In this work we revisit the LSH problem from a Frequentist setting and formulate sequential tests for composite hypothesis (similarity greater than or less than threshold) that can be leveraged by such LSH algorithms for adaptively pruning candidates aggressively. We propose a vanilla sequential probability ratio test (SPRT) approach based on this idea and two novel variants. We extend these variants to the case where approximate similarity needs to be computed using fixed-width sequential confidence interval generation technique. We compare these novel variants with the SPRT variant and BayesLSH/Bayes-LSHLite variants and show that they can provide tighter qualitative guarantees over BayesLSH/BayesLSHLite -- a state-of-the-art approach -- while being up to 2.1x faster than a traditional SPRT and 8.8x faster than AllPairs.
Opinion Spam Detection in Web Forum: A Real Case Study BIBAFull-Text 173-183
  Yu-Ren Chen; Hsin-Hsi Chen
Opinion spamming refers to the illegal marketing practice which involves delivering commercially advantageous opinions as regular users. In this paper, we conduct a real case study based on a set of internal records of opinion spams leaked from a shady marketing campaign. We explore the characteristics of opinion spams and spammers in a web forum to obtain some insights, including subtlety property of opinion spams, spam post ratio, spammer accounts, first post and replies, submission time of posts, activeness of threads, and collusion among spammers. Then we present features that could be potentially helpful in detecting spam opinions in threads. The results of spam detection on first posts show: (1) spam first posts put more focus on certain topics such as the user experiences' on the promoted items, (2) spam first posts generally use more words and pictures to showcase the promoted items in an attempt to impress people, (3) spam first posts tend to be submitted during work time, and (4) the threads that spam first posts initiate are more active to be placed at striking positions. The spam detection on replies is more challenging. Besides lower spam ratio and less content, replies even do not mention the promoted items. Their major intention is to keep the discussion in a thread alive to attract more attention on it. Submission time of replies, thread activeness, position of replies, and spamicity of first post are more useful than content-based features in spam detection on replies.
Summarizing Entity Descriptions for Effective and Efficient Human-centered Entity Linking BIBAFull-Text 184-194
  Gong Cheng; Danyun Xu; Yuzhong Qu
Entity linking connects the Web of documents with knowledge bases. It is the task of linking an entity mention in text to its corresponding entity in a knowledge base. Whereas a large body of work has been devoted to automatically generating candidate entities, or ranking and choosing from them, manual efforts are still needed, e.g., for defining gold-standard links for evaluating automatic approaches, and for improving the quality of links in crowdsourcing approaches. However, structured descriptions of entities in knowledge bases are sometimes very long. To avoid overloading human users with too much information and help them more efficiently choose an entity from candidates, we aim to substitute entire entity descriptions with compact, equally effective structured summaries that are automatically generated. To achieve it, our approach analyzes entity descriptions in the knowledge base and the context of entity mention from multiple perspectives, including characterizing and differentiating power, information overlap, and relevance to context. Extrinsic evaluation (where human users carry out entity linking tasks) and intrinsic evaluation (where human users rate summaries) demonstrate that summaries generated by our approach help human users carry out entity linking tasks more efficiently (22-23% faster), without significantly affecting the quality of links obtained; and our approach outperforms existing approaches to summarizing entity descriptions.
Semantic Tagging of Mathematical Expressions BIBAFull-Text 195-204
  Pao-Yu Chien; Pu-Jen Cheng
Semantic tagging of mathematical expressions (STME) gives semantic meanings to tokens in mathematical expressions. In this work, we propose a novel STME approach that relies on neither text along with expressions, nor labelled training data. Instead, our method only requires a mathematical grammar set. We point out that, besides the grammar of mathematics, the special property of variables and user habits of writing expressions help us understand the implicit intents of the user. We build a system that considers both restrictions from the grammar and variable properties, and then apply an unsupervised method to our probabilistic model to learn the user habits. To evaluate our system, we build large-scale training and test datasets automatically from a public math forum. The results demonstrate the significant improvement of our method, compared to the maximum-frequency baseline. We also create statistics to reveal the properties of mathematics language.
Collaborative Ranking with a Push at the Top BIBAFull-Text 205-215
  Konstantina Christakopoulou; Arindam Banerjee
The goal of collaborative filtering is to get accurate recommendations at the top of the list for a set of users. From such a perspective, collaborative ranking based formulations with suitable ranking loss functions are natural. While recent literature has explored the idea based on objective functions such as NDCG or Average Precision, such objectives are difficult to optimize directly. In this paper, building on recent advances from the learning to rank literature, we introduce a novel family of collaborative ranking algorithms which focus on accuracy at the top of the list for each user while learning the ranking functions collaboratively. We consider three specific formulations, based on collaborative p-norm push, infinite push, and reverse-height push, and propose efficient optimization methods for learning these models. Experimental results illustrate the value of collaborative ranking, and show that the proposed methods are competitive, usually better than existing popular approaches to personalized recommendation.
Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications BIBAFull-Text 216-226
  Christopher Michael De Vries; Lance De Vine; Shlomo Geva; Richi Nayak
The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.
Network-based Origin Confusion Attacks against HTTPS Virtual Hosting BIBAFull-Text 227-237
  Antoine Delignat-Lavaud; Karthikeyan Bhargavan
We investigate current deployment practices for virtual hosting, a widely used method for serving multiple HTTP and HTTPS origins from the same server, in popular content delivery networks, cloud-hosting infrastructures, and web servers. Our study uncovers a new class of HTTPS origin confusion attacks: when two virtual hosts use the same TLS certificate, or share a TLS session cache or ticket encryption key, a network attacker may cause a page from one of them to be loaded under the other's origin in a client browser. These attacks appear when HTTPS servers are configured to allow virtual host fallback from a client-requested, secure origin to some other unexpected, less-secure origin. We present evidence that such vulnerable virtual host configurations are widespread, even on the most popular and security-scrutinized websites, thus allowing a network adversary to hijack pages, or steal secure cookies and single sign-on tokens. To prevent our virtual host confusion attacks and recover the isolation guarantees that are commonly assumed in shared hosting environments, we propose fixes to web server software and advocate conservative configuration guidelines for the composition of HTTP with TLS.
The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk BIBAFull-Text 238-247
  Djellel Eddine Difallah; Michele Catasta; Gianluca Demartini; Panagiotis G. Ipeirotis; Philippe Cudré-Mauroux
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc. In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content BIBAFull-Text 248-255
  Nemanja Djuric; Hao Wu; Vladan Radosavljevic; Mihajlo Grbovic; Narayan Bhamidipati
We consider the problem of learning distributed representations for documents in data streams. The documents are represented as low-dimensional vectors and are jointly learned with distributed vector representations of word tokens using a hierarchical framework with two embedded neural language models. In particular, we exploit the context of documents in streams and use one of the language models to model the document sequences, and the other to model word sequences within them. The models learn continuous vector representations for both word tokens and documents such that semantically similar documents and words are close in a common vector space. We discuss extensions to our model, which can be applied to personalized recommendation and social relationship mining by adding further user layers to the hierarchy, thus learning user-specific vectors to represent individual preferences. We validated the learned representations on a public movie rating data set from MovieLens, as well as on a large-scale Yahoo News data comprising three months of user activity logs collected on Yahoo servers. The results indicate that the proposed model can learn useful representations of both documents and word tokens, outperforming the current state-of-the-art by a large margin.
Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments BIBAFull-Text 256-266
  Alexey Drutsa; Gleb Gusev; Pavel Serdyukov
Modern Internet companies improve their services by means of data-driven decisions that are based on online controlled experiments (also known as A/B tests). To run more online controlled experiments and to get statistically significant results faster are the emerging needs for these companies. The main way to achieve these goals is to improve the sensitivity of A/B experiments. We propose a novel approach to improve the sensitivity of user engagement metrics (that are widely used in A/B tests) by utilizing prediction of the future behavior of an individual user. This problem of prediction of the exact value of a user engagement metric is also novel and is studied in our work. We demonstrate the effectiveness of our sensitivity improvement approach on several real online experiments run at Yandex. Especially, we show how it can be used to detect the treatment effect of an A/B test faster with the same level of statistical significance.
Enriching Structured Knowledge with Open Information BIBAFull-Text 267-277
  Arnab Dutta; Christian Meilicke; Heiner Stuckenschmidt
We propose an approach for semantifying web extracted facts. In particular, we map subject and object terms of these facts to instances; and relational phrases to object properties defined in a target knowledge base. By doing this we resolve the ambiguity inherent in the web extracted facts, while simultaneously enriching the target knowledge base with a significant number of new assertions. In this paper, we focus on the mapping of the relational phrases in the context of the overall work ow. Furthermore, in an open extraction setting identical semantic relationships can be represented by different surface forms, making it necessary to group these surface forms together. To solve this problem we propose the use of Markov clustering. In this work we present a complete, ontology independent, generalized workflow which we evaluate on facts extracted by Nell and Reverb. Our target knowledge base is DBpedia. Our evaluation shows promising results in terms of producing highly precise facts. Moreover, the results indicate that the clustering of relational phrases pays of in terms of an improved instance and property mapping.
A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems BIBAFull-Text 278-288
  Ali Mamdouh Elkahky; Yang Song; Xiaodong He
Recent online services rely heavily on automatic personalization to recommend relevant content to a large number of users. This requires systems to scale promptly to accommodate the stream of new users visiting the online services for the first time. In this work, we propose a content-based recommendation system to address both the recommendation quality and the system scalability. We propose to use a rich feature set to represent users, according to their web browsing history and search queries. We use a Deep Learning approach to map users and items to a latent space where the similarity between users and their preferred items is maximized. We extend the model to jointly learn from features of items from different domains and user features by introducing a multi-view Deep Learning model. We show how to make this rich-feature based user representation scalable by reducing the dimension of the inputs and the amount of training data. The rich user feature representation allows the model to learn relevant user behavior patterns and give useful recommendations for users who do not have any interaction with the service, given that they have adequate search and browsing history. The combination of different domains into a single model for learning helps improve the recommendation quality across all the domains, as well as having a more compact and a semantically richer user latent feature vector. We experiment with our approach on three real-world recommendation systems acquired from different sources of Microsoft products: Windows Apps recommendation, News recommendation, and Movie/TV recommendation. Results indicate that our approach is significantly better than the state-of-the-art algorithms (up to 49% enhancement on existing users and 115% enhancement on new users). In addition, experiments on a publicly open data set also indicate the superiority of our method in comparison with transitional generative topic models, for modeling cross-domain recommender systems. Scalability analysis show that our multi-view DNN model can easily scale to encompass millions of users and billions of item entries. Experimental results also confirm that combining features from all domains produces much better performance than building separate models for each domain.
Cookies That Give You Away: The Surveillance Implications of Web Tracking BIBAFull-Text 289-299
  Steven Englehardt; Dillon Reisman; Christian Eubank; Peter Zimmerman; Jonathan Mayer; Arvind Narayanan; Edward W. Felten
We study the ability of a passive eavesdropper to leverage "third-party" HTTP tracking cookies for mass surveillance. If two web pages embed the same tracker which tags the browser with a unique cookie, then the adversary can link visits to those pages from the same user (i.e., browser instance) even if the user's IP address varies. Further, many popular websites leak a logged-in user's identity to an eavesdropper in unencrypted traffic. To evaluate the effectiveness of our attack, we introduce a methodology that combines web measurement and network measurement. Using OpenWPM, our web privacy measurement platform, we simulate users browsing the web and find that the adversary can reconstruct 62-73% of a typical user's browsing history. We then analyze the effect of the physical location of the wiretap as well as legal restrictions such as the NSA's "one-end foreign" rule. Using measurement units in various locations -- Asia, Europe, and the United States -- we show that foreign users are highly vulnerable to the NSA's dragnet surveillance due to the concentration of third-party trackers in the U.S. Finally, we find that some browser-based privacy tools mitigate the attack while others are largely ineffective.
Efficient Densest Subgraph Computation in Evolving Graphs BIBAFull-Text 300-310
  Alessandro Epasto; Silvio Lattanzi; Mauro Sozio
Densest subgraph computation has emerged as an important primitive in a wide range of data analysis tasks such as community and event detection. Social media such as Facebook and Twitter are highly dynamic with new friendship links and tweets being generated incessantly, calling for efficient algorithms that can handle very large and highly dynamic input data. While either scalable or dynamic algorithms for finding densest subgraphs have been proposed, a viable and satisfactory solution for addressing both the dynamic aspect of the input data and its large size is still missing. We study the densest subgraph problem in the dynamic graph model, for which we present the first scalable algorithm with provable guarantees. In our model, edges are added adversarially while they are removed uniformly at random from the current graph. We show that at any point in time we are able to maintain a 2(1+ε)-approximation of a current densest subgraph, while requiring O(polylog(n+r)) amortized cost per update (with high probability), where r is the total number of update operations executed and n is the maximum number of nodes in the graph. In contrast, a naive algorithm that recomputes a dense subgraph every time the graph changes requires Omega(m) work per update, where m is the number of edges in the current graph. Our theoretical analysis is complemented with an extensive experimental evaluation on large real-world graphs showing that (approximate) densest subgraphs can be maintained efficiently within hundred of microseconds per update.
A Practical Framework for Privacy-Preserving Data Analytics BIBAFull-Text 311-321
  Liyue Fan; Hongxia Jin
The availability of an increasing amount of user generated data is transformative to our society. We enjoy the benefits of analyzing big data for public interest, such as disease outbreak detection and traffic control, as well as for commercial interests, such as smart grid and product recommendation. However, the large collection of user generated data contains unique patterns and can be used to re-identify individuals, which has been exemplified by the AOL search log release incident. In this paper, we propose a practical framework for data analytics, while providing differential privacy guarantees to individual data contributors. Our framework generates differentially private aggregates which can be used to perform data mining and recommendation tasks. To alleviate the high perturbation errors introduced by the differential privacy mechanism, we present two methods with different sampling techniques to draw a subset of individual data for analysis. Empirical studies with real-world data sets show that our solutions enable accurate data analytics on a small fraction of the input data, reducing user privacy risk and data storage requirement without compromising the analysis results.
Compressed Indexes for String Searching in Labeled Graphs BIBAFull-Text 322-332
  Paolo Ferragina; Francesco Piccinno; Rossano Venturini
Storing and searching large labeled graphs is indeed becoming a key issue in the design of space/time efficient online platforms indexing modern social networks or knowledge graphs. But, as far as we know, all these results are limited to design compressed graph indexes which support basic access operations onto the link structure of the input graph, such as: given a node u, return the adjacency list of u. This paper takes inspiration from the Facebook Unicorn's platform and proposes some compressed-indexing schemes for large graphs whose nodes are labeled with strings of variable length -- i.e., node's attributes such as user's (nick-)name -- that support sophisticated search operations which involve both the linked structure of the graph and the string content of its nodes.
   An extensive experimental evaluation over real social networks will show the time and space efficiency of the proposed indexing schemes and their query processing algorithms.
Improving Paid Microtasks through Gamification and Adaptive Furtherance Incentives BIBAFull-Text 333-343
  Oluwaseyi Feyisetan; Elena Simperl; Max Van Kleek; Nigel Shadbolt
Crowdsourcing via paid microtasks has been successfully applied in a plethora of domains and tasks. Previous efforts for making such crowdsourcing more effective have considered aspects as diverse as task and workflow design, spam detection, quality control, and pricing models. Our work expands upon such efforts by examining the potential of adding gamification to microtask interfaces as a means of improving both worker engagement and effectiveness. We run a series of experiments in image labeling, one of the most common use cases for microtask crowdsourcing, and analyse worker behavior in terms of number of images completed, quality of annotations compared against a gold standard, and response to financial and game-specific rewards. Each experiment studies these parameters in two settings: one based on a state-of-the-art, non-gamified task on CrowdFlower and another one using an alternative interface incorporating several game elements. Our findings show that gamification leads to better accuracy and lower costs than conventional approaches that use only monetary incentives. In addition, it seems to make paid microtask work more rewarding and engaging, especially when sociality features are introduced. Following these initial insights, we define a predictive model for estimating the most appropriate incentives for individual workers, based on their previous contributions. This allows us to build a personalised game experience, with gains seen on the volume and quality of work completed.
Tagging Personal Photos with Transfer Deep Learning BIBAFull-Text 344-354
  Jianlong Fu; Tao Mei; Kuiyuan Yang; Hanqing Lu; Yong Rui
The advent of mobile devices and media cloud services has led to the unprecedented growing of personal photo collections. One of the fundamental problems in managing the increasing number of photos is automatic image tagging. Existing research has predominantly focused on tagging general Web images with a well-labelled image database, e.g., ImageNet. However, they can only achieve limited success on personal photos due to the domain gaps between personal photos and Web images. These gaps originate from the differences in semantic distribution and visual appearance. To deal with these challenges, in this paper, we present a novel transfer deep learning approach to tag personal photos. Specifically, to solve the semantic distribution gap, we have designed an ontology consisting of a hierarchical vocabulary tailored for personal photos. This ontology is mined from 10,000 active users in Flickr with 20 million photos and 2.7 million unique tags. To deal with the visual appearance gap, we discover the intermediate image representations and ontology priors by deep learning with bottom-up and top-down transfers across two domains, where Web images are the source domain and personal photos are the target. Moreover, we present two modes (single and batch-modes) in tagging and find that the batch-mode is highly effective to tag photo collections. We conducted personal photo tagging on 7,000 real personal photos and personal photo search on the MIT-Adobe FiveK photo dataset. The proposed tagging approach is able to achieve a performance gain of 12.8% and 4.5% in terms of NDCG@5, against the state-of-the-art hand-crafted feature-based and deep learning-based methods, respectively.
MobInsight: On Improving The Performance of Mobile Apps in Cellular Networks BIBAFull-Text 355-365
  Vijay Gabale; Dilip Krishnaswamy
It is well-known that the performance of Web-browsing as well as mobile applications (or apps) suffers on today's cellular networks. In this work, we perform a systematic measurement study of more than 50 popular apps and 2 cellular networks, and discover that while cellular networks have predictable latency, it is the path between exit points of cellular networks (e.g., GGSN) and cloud-servers that degrades apps performance. High latency and unpredictability over this path affects browsing and activity completion times of apps, worsening the performance by several magnitudes. Furthermore, we find that as the number of apps on mobile devices increases, cellular networks in turn suffer due to large number of active connections, primarily used for push notifications, experiencing heavy signaling overhead in the network. Towards accelerating the performance of apps and improving their operational efficiency, we envision an easy to deploy operator-managed platform, and study two architectural optimizations that sit at vantage points inside cellular networks: virtual app-server (vApp) and network-assisted, virtual push-notification server (vPNS). vApps improve apps' browsing experience while vPNSs take the burden of carrying periodic message off cellular networks. Using trace-driven simulations, we find that vApps can improve activity completion times by more than 3-fold, whereas vPNS can reduce the signaling load by a factor of 6 in cellular networks and reduce energy consumption by a factor of 2 on mobile devices.
Rethinking Security of Web-Based System Applications BIBAFull-Text 366-376
  Martin Georgiev; Suman Jana; Vitaly Shmatikov
Many modern desktop and mobile platforms, including Ubuntu, Google Chrome, Windows, and Firefox OS, support so called Web-based system applications that run outside the Web browser and enjoy direct access to native objects such as files, camera, and geolocation. We show that the access-control models of these platforms are (a) incompatible and (b) prone to unintended delegation of native-access rights: when applications request native access for their own code, they unintentionally enable it for untrusted third-party code, too. This enables malicious ads and other third-party content to steal users' OAuth authentication credentials, access camera on their devices, etc.
   We then design, implement, and evaluate PowerGate, a new access-control mechanism for Web-based system applications. It solves two key problems plaguing all existing platforms: security and consistency. First, unlike the existing platforms, PowerGate correctly protects native objects from unauthorized access. Second, PowerGate provides uniform access-control semantics across all platforms and is 100% backward compatible. PowerGate enables application developers to write well-defined native-object access policies with explicit principals such as "application's own local code" and "third-party Web code," is easy to configure, and incurs negligible performance overhead.
Cardinal Contests BIBAFull-Text 377-387
  Arpita Ghosh; Patrick Hummel
Contests are widely used as a means for effort elicitation in settings ranging from government R&D contests to online crowdsourcing contests on platforms such as Kaggle, Innocentive, or TopCoder. Such rank-order mechanisms -- where agents' rewards depend only on the relative ranking of their submissions' qualities -- are natural mechanisms for incentivizing effort when it is easier to obtain ordinal, rather than cardinal, information about agents' outputs, or where absolute measures of quality are unverifiable. An increasing number of online contests, however, rank entries according to some numerical evaluation of their absolute quality -- for instance, the performance of an algorithm on a test dataset, or the performance of an intervention in a randomized trial. Can the contest designer incentivize higher effort by making the rewards in an ordinal rank-order mechanism contingent on such cardinal information? We model and analyze cardinal contests, where a principal running a rank-order tournament has access to an absolute measure of the qualities of agents' submissions in addition to their relative rankings, and ask how modifying the rank-order tournament to incorporate cardinal information can improve incentives for effort. Our main result is that a simple threshold mechanism -- a mechanism that awards the prize for a rank if and only if the absolute quality of the agent at that rank exceeds a certain threshold -- is optimal amongst all mixed cardinal-ordinal mechanisms where the fraction of the jth prize awarded to the jth-ranked agent is any arbitrary non-decreasing function of her submission's quality. Further, the optimal threshold mechanism uses exactly the same threshold for each rank. We study what contest parameters determine the extent of the benefit from incorporating such cardinal information into an ordinal rank-order contest, and investigate the extent of improvement in equilibrium effort via numerical simulations.
Accessible On-Line Floor Plans BIBAFull-Text 388-398
  Cagatay Goncu; Anuradha Madugalla; Simone Marinai; Kim Marriott
Better access to on-line information graphics is a pressing need for people who are blind or have severe vision impairment. We present a new model for accessible presentation of on-line information graphics and demonstrate its use for presenting floor plans. While floor plans are increasingly provided on-line, people who are blind are at best provided with only a high-level textual description. This makes it difficult for them to understand the spatial arrangement of the objects on the floor plan. Our new approach provides users with significantly better access to such plans. The users can automatically generate an accessible version of a floor plan from an on-line floor plan image quickly and independently by using a web service. This generates a simplified graphic showing the rooms, walls, doors and windows in the original floor plan as well as a textual overview. The accessible floor plan is presented on an iPad using audio feedback. As the users touch graphic elements on the screen, the element they are touching is described by speech and non-speech audio in order to help them navigate the graphic.
Network A/B Testing: From Sampling to Estimation BIBAFull-Text 399-409
  Huan Gui; Ya Xu; Anmol Bhasin; Jiawei Han
A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used in online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. The goal of A/B testing is to estimate the treatment effect of a new change, which becomes intricate when users are interacting, i.e., the treatment effect of a user may spill over to other users via underlying social connections.When conducting these online controlled experiments, it is a common practice to make the Stable Unit Treatment Value Assumption (SUTVA) that each individual's response is affected by their own treatment only. Though this assumption simplifies the estimation of treatment effect, it does not hold when network interference is present, and may even lead to wrong conclusion.
   In this paper, we study the problem of network A/B testing in real networks, which have substantially different characteristics from the simulated random networks studied in previous works. We first examine the existence of network effect in a recent online experiment conducted at LinkedIn; Secondly, we propose an efficient and effective estimator for Average Treatment Effect (ATE) considering the interference between users in real online experiments; Finally, we apply our method in both simulations and a real world online experiment. The simulation results show that our estimator achieves better performance with respect to both bias and variance reduction. The real world online experiment not only demonstrates that large-scale network A/B test is feasible but also further validates many of our observations in the simulation studies.
User Session Identification Based on Strong Regularities in Inter-activity Time BIBAFull-Text 410-418
  Aaron Halfaker; Oliver Keyes; Daniel Kluver; Jacob Thebault-Spieker; Tien Nguyen; Kenneth Shores; Anuradha Uduwage; Morten Warncke-Wang
Session identification is a common strategy used to develop metrics for web analytics and perform behavioral analyses of user-facing systems. Past work has argued that session identification strategies based on an inactivity threshold is inherently arbitrary or has advocated that thresholds be set at about 30 minutes. In this work, we demonstrate a strong regularity in the temporal rhythms of user initiated events across several different domains of online activity (incl. video gaming, search, page views and volunteer contributions). We describe a methodology for identifying clusters of user activity and argue that the regularity with which these activity clusters appear implies a good rule-of-thumb inactivity threshold of about 1 hour. We conclude with implications that these temporal rhythms may have for system design based on our observations and theories of goal-directed human activity.
Incentivizing High Quality Crowdwork BIBAFull-Text 419-429
  Chien-Ju Ho; Aleksandrs Slivkins; Siddharth Suri; Jennifer Wortman Vaughan
We study the causal effects of financial incentives on the quality of crowdwork. We focus on performance-based payments (PBPs), bonus payments awarded to workers for producing high quality work. We design and run randomized behavioral experiments on the popular crowdsourcing platform Amazon Mechanical Turk with the goal of understanding when, where, and why PBPs help, identifying properties of the payment, payment structure, and the task itself that make them most effective. We provide examples of tasks for which PBPs do improve quality. For such tasks, the effectiveness of PBPs is not too sensitive to the threshold for quality required to receive the bonus, while the magnitude of the bonus must be large enough to make the reward salient. We also present examples of tasks for which PBPs do not improve quality. Our results suggest that for PBPs to improve quality, the task must be effort-responsive: the task must allow workers to produce higher quality work by exerting more effort. We also give a simple method to determine if a task is effort-responsive a priori. Furthermore, our experiments suggest that all payments on Mechanical Turk are, to some degree, implicitly performance-based in that workers believe their work may be rejected if their performance is sufficiently poor. Finally, we propose a new model of worker behavior that extends the standard principal-agent model from economics to include a worker's subjective beliefs about his likelihood of being paid, and show that the predictions of this model are in line with our experimental findings. This model may be useful as a foundation for theoretical studies of incentives in crowdsourcing markets.
Skolemising Blank Nodes while Preserving Isomorphism BIBAFull-Text 430-440
  Aidan Hogan
In this paper, we propose and evaluate a scheme to produce canonical labels for blank nodes in RDF graphs. These labels can be used as the basis for a Skolemisation scheme that gets rid of the blank nodes in an RDF graph by mapping them to globally canonical IRIs. Assuming no hash collisions, the scheme guarantees that two Skolemised graphs will be equal if and only if the two input graphs are isomorphic. Although the proposed scheme is exponential in the worst case, we claim that such cases are unlikely to be encountered in practice. To support these claims, we present the results of applying our Skolemisation scheme over a diverse collection of 43.5 million real-world RDF graphs (BTC-2014); we also provide results for some nasty synthetic cases.
Scalable Methods for Adaptively Seeding a Social Network BIBAFull-Text 441-451
  Thibaut Horel; Yaron Singer
In recent years, social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for spreading information effectively through influential users. In many applications, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of social networks, these users often rank low in terms of their influence potential. An alternative approach one can consider is an adaptive method which selects users in a manner which targets their influential neighbors. The advantage of such an approach is that it leverages the friendship paradox in social networks: while users are often not influential, they often know someone who is. Despite the various complexities in such optimization problems, we show that scalable adaptive seeding is achievable. In particular, we develop algorithms for linear influence models with provable approximation guarantees that can be gracefully parallelized. To show the effectiveness of our methods we collected data from various verticals social network users follow. For each vertical, we collected data on the users who responded to a certain post as well as their neighbors, and applied our methods on this data. Our experiments show that adaptive seeding is scalable, and importantly, that it obtains dramatic improvements over standard approaches of information dissemination.
User Review Sites as a Resource for Large-Scale Sociolinguistic Studies BIBAFull-Text 452-461
  Dirk Hovy; Anders Johannsen; Anders Søgaard
Sociolinguistic studies investigate the relation between language and extra-linguistic variables. This requires both representative text data and the associated socio-economic meta-data of the subjects. Traditionally, sociolinguistic studies use small samples of hand-curated data and meta-data. This can lead to exaggerated or false conclusions. Using social media data offers a large-scale source of language data, but usually lacks reliable socio-economic meta-data. Our research aims to remedy both problems by exploring a large new data source, international review websites with user profiles. They provide more text data than manually collected studies, and more meta-data than most available social media text. We describe the data and present various pilot studies, illustrating the usefulness of this resource for sociolinguistic studies. Our approach can help generate new research hypotheses based on data-driven findings across several countries and languages.
When Does Improved Targeting Increase Revenue? BIBAFull-Text 462-472
  Patrick Hummel; R. Preston McAfee
In second price auctions with symmetric bidders, we find that improved targeting via enhanced information disclosure decreases revenue when there are two bidders and increases revenue if there are at least four bidders. With asymmetries, improved targeting increases revenue if the most frequent winner wins less than 30.4% of the time, but can decrease revenue otherwise. We derive analogous results for position auctions. Finally, we show that revenue can vary non-monotonically with the number of bidders who are able to take advantage of improved targeting.
Social Status and Badge Design BIBAFull-Text 473-483
  Nicole Immorlica; Greg Stoddard; Vasilis Syrgkanis
Many websites encourage user participation via the use of virtual rewards like badges. While badges typically have no explicit value, they act as symbols of social status within a community. In this paper, we study how to design virtual incentive mechanisms that maximize total contributions to a website when users are motivated by social status. We consider a game-theoretic model where users exert costly effort to make contributions and, in return, are awarded with badges. The value of a badge is determined endogenously by the number of users who earn an equal or higher badge; as more users earn a particular badge, the value of that badge diminishes for all users. We show that among all possible mechanisms for assigning status-driven rewards, the optimal mechanism is a leaderboard with a cutoff: users that contribute less than a certain threshold receive nothing while the remaining are ranked by contribution. We next study the necessary features of approximately optimal mechanisms and find that approximate optimality is influenced by the convexity of status valuations. When status valuations are concave, any approximately optimal mechanism must contain a coarse status partition, i.e. a partition of users into status classes whose size will grow as the population grows. Conversely when status valuations are convex, we prove that fine partitioning, that is a partition of users into status classes whose size stays constant as the population grows, is necessary for approximate optimality.
Mapping Temporal Horizons: Analysis of Collective Future and Past related Attention in Twitter BIBAFull-Text 484-494
  Adam Jatowt; Émilien Antoine; Yukiko Kawai; Toyokazu Akiyama
Microblogging platforms such as Twitter have recently received much attention as great sources for live web sensing, real-time event detection and opinion analysis. Previous works usually assumed that tweets mainly describe "what's happening now". However, a large portion of tweets contains time expressions that refer to time frames within the past or the future. Such messages often reflect expectations or memories of social media users. In this work we investigate how microblogging users collectively refer to time. In particular, we analyze half a year long portion of Japanese and four months long collection of US tweets and we quantify collective temporal attention of users as well as other related temporal characteristics. This kind of knowledge is helpful in the context of growing interest for detection and prediction of important events within social media. The exploratory analysis we perform is possible thanks to the development of visual analytics framework for robust overview and easy detection of various regularities in the past and future-oriented thinking of Twitter users. We believe that the visualizations we provide and the findings we outline can be also valuable for sociologists and computer scientists to test and refine their models about time in natural language.
Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts BIBAFull-Text 495-505
  Madhav Jha; C. Seshadhri; Ali Pinar
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization.
   We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.
Automatic Online Evaluation of Intelligent Assistants BIBAFull-Text 506-516
  Jiepu Jiang; Ahmed Hassan Awadallah; Rosie Jones; Umut Ozertem; Imed Zitouni; Ranjitha Gurunath Kulkarni; Omar Zia Khan
Voice-activated intelligent assistants, such as Siri, Google Now, and Cortana, are prevalent on mobile devices. However, it is challenging to evaluate them due to the varied and evolving number of tasks supported, e.g., voice command, web search, and chat. Since each task may have its own procedure and a unique form of correct answers, it is expensive to evaluate each task individually. This paper is the first attempt to solve this challenge. We develop consistent and automatic approaches that can evaluate different tasks in voice-activated intelligent assistants. We use implicit feedback from users to predict whether users are satisfied with the intelligent assistant as well as its components, i.e., speech recognition and intent classification. Using this approach, we can potentially evaluate and compare different tasks within and across intelligent assistants ac-cording to the predicted user satisfaction rates. Our approach is characterized by an automatic scheme of categorizing user-system interaction into task-independent dialog actions, e.g., the user is commanding, selecting, or confirming an action. We use the action sequence in a session to predict user satisfaction and the quality of speech recognition and intent classification. We also incorporate other features to further improve our approach, including features derived from previous work on web search satisfaction prediction, and those utilizing acoustic characteristics of voice requests. We evaluate our approach using data collected from a user study. Results show our approach can accurately identify satisfactory and unsatisfactory sessions.
Incorporating Social Context and Domain Knowledge for Entity Recognition BIBAFull-Text 517-526
  Jie Tang; Zhanpeng Fang; Jimeng Sun
Recognizing entity instances in documents according to a knowledge base is a fundamental problem in many data mining applications. The problem is extremely challenging for short documents in complex domains such as social media and biomedical domains. Large concept spaces and instance ambiguity are key issues that need to be addressed. Most of the documents are created in a social context by common authors via social interactions, such as reply and citations. Such social contexts are largely ignored in the instance-recognition literature. How can users' interactions help entity instance recognition? How can the social context be modeled so as to resolve the ambiguity of different instances?
   In this paper, we propose the SOCINST model to formalize the problem into a probabilistic model. Given a set of short documents (e.g., tweets or paper abstracts) posted by users who may connect with each other, SOCINST can automatically construct a context of subtopics for each instance, with each subtopic representing one possible meaning of the instance. The model is also able to incorporate social relationships between users to help build social context. We further incorporate domain knowledge into the model using a Dirichlet tree distribution.
   We evaluate the proposed model on three different genres of datasets: ICDM'12 Contest, Weibo, and I2B2. In ICDM'12 Contest, the proposed model clearly outperforms (+21.4%; $p < 1e-5 with t-test) all the top contestants. In Weibo and I2B2, our results also show that the recognition accuracy of SOCINST is up to 5.3-26.6% better than those of several alternative methods.
Querying Web-Scale Information Networks Through Bounding Matching Scores BIBAFull-Text 527-537
  Jiahui Jin; Samamon Khemmarat; Lixin Gao; Junzhou Luo
Web-scale information networks containing billions of entities are common nowadays. Querying these networks can be modeled as a subgraph matching problem. Since information networks are incomplete and noisy in nature, it is important to discover answers that match exactly as well as answers that are similar to queries. Existing graph matching algorithms usually use graph indices to improve the efficiency of query processing. For web-scale information networks, it may not be feasible to build the graph indices due to the amount of work and the memory/storage required. In this paper, we propose an efficient algorithm for finding the best k answers for a given query without precomputing graph indices. The quality of an answer is measured by a matching score that is computed online. To speed up query processing, we propose a novel technique for bounding the matching scores during the computation. By using bounds, we can efficiently prune the answers that have low qualities without having to evaluate all possible answers. The bounding technique can be implemented in a distributed environment, allowing our approach to efficiently answer the queries on web-scale information networks. We demonstrate the effectiveness and the efficiency of our approach through a series of experiments on real-world information networks. The result shows that our bounding technique can reduce the running time up to two orders of magnitude comparing to an approach that does not use bounds.
LN-Annote: An Alternative Approach to Information Extraction from Emails using Locally-Customized Named-Entity Recognition BIBAFull-Text 538-548
  YoungHoon Jung; Karl Stratos; Luca P. Carloni
Personal mobile devices offer a growing variety of personalized services that enrich considerably the user experience. This is made possible by increased access to personal information, which to a large extent is extracted from user email messages and archives. There are, however, two main issues. First, currently these services can be offered only by large web-service companies that can also deploy email services. Second, keeping a large amount of structured personal information on the cloud raises privacy concerns. To address these problems, we propose LN-Annote, a new method to extract personal information from the email that is locally available on mobile devices (without remote access to the cloud). LN-Annote enables third-party service providers to build a question-answering system on top of the local personal information without having to own the user data. In addition, LN-Annote mitigates the privacy concerns by keeping the structured personal information directly on the personal device. Our method is based on a named-entity recognizer trained in two separate steps: first using a common dataset on the cloud and then using a personal dataset in the mobile device at hand. Our contributions include also the optimization of the implementation of LN-Annote: in particular, we implemented an OpenCL version of the custom-training algorithm to leverage the Graphic Processing Unit (GPU) available on the mobile device. We present an extensive set of experiment results: beside proving the feasibility of our approach, they demonstrate its efficiency in terms of the named-entity extraction performance as well as the execution speed and the energy consumption spent in mobile devices.
Describing and Understanding Neighborhood Characteristics through Online Social Media BIBAFull-Text 549-559
  Mohamed Kafsi; Henriette Cramer; Bart Thomee; David A. Shamma
Geotagged data can be used to describe regions in the world and discover local themes. However, not all data produced within a region is necessarily specifically descriptive of that area. To surface the content that is characteristic for a region, we present the geographical hierarchy model (GHM), a probabilistic model based on the assumption that data observed in a region is a random mixture of content that pertains to different levels of a hierarchy. We apply the GHM to a dataset of 8 million Flickr photos in order to discriminate between content (i.e. tags) that specifically characterizes a region (e.g. neighborhood) and content that characterizes surrounding areas or more general themes. Knowledge of the discriminative and non-discriminative terms used throughout the hierarchy enables us to quantify the uniqueness of a given region and to compare similar but distant regions. Our evaluation demonstrates that our model improves upon traditional Naive Bayes classification by 47% and hierarchical TF-IDF by 27%. We further highlight the differences and commonalities with human reasoning about what is locally characteristic for a neighborhood, distilled from ten interviews and a survey that covered themes such as time, events, and prior regional knowledge.
Active Learning for Multi-relational Data Construction BIBAFull-Text 560-569
  Hiroshi Kajino; Akihiro Kishimoto; Adi Botea; Elizabeth Daly; Spyros Kotoulas
Knowledge on the Web relies heavily on multi-relational representations, such as RDF and Schema.org. Automatically extracting knowledge from documents and linking existing databases are common approaches to construct multi-relational data. Complementary to such approaches, there is still a strong demand for manually encoding human expert knowledge. For example, human annotation is necessary for constructing a common-sense knowledge base, which stores facts implicitly shared in a community, because such knowledge rarely appears in documents. As human annotation is both tedious and costly, an important research challenge is how to best use limited human resources, whiles maximizing the quality of the resulting dataset. In this paper, we formalize the problem of dataset construction as active learning problems and present the Active Multi-relational Data Construction (AMDC) method. AMDC repeatedly interleaves multi-relational learning and expert input acquisition, allowing us to acquire helpful labels for data construction. Experiments on real datasets demonstrate that our solution increases the number of positive triples by a factor of 2.28 to 17.0, and that the predictive performance of the multi-relational model in AMDC achieves the highest or comparable to the best performance throughout the data construction process.
The Social World of Content Abusers in Community Question Answering BIBAFull-Text 570-580
  Imrul Kayes; Nicolas Kourtellis; Daniele Quercia; Adriana Iamnitchi; Francesco Bonchi
Community-based question answering platforms can be rich sources of information on a variety of specialized topics, from finance to cooking. The usefulness of such platforms depends heavily on user contributions (questions and answers), but also on respecting the community rules. As a crowd-sourced service, such platforms rely on their users for monitoring and flagging content that violates community rules. Common wisdom is to eliminate the users who receive many flags. Our analysis of a year of traces from a mature Q&A site shows that the number of flags does not tell the full story: on one hand, users with many flags may still contribute positively to the community. On the other hand, users who never get flagged are found to violate community rules and get their accounts suspended. This analysis, however, also shows that abusive users are betrayed by their network properties: we find strong evidence of homophilous behavior and use this finding to detect abusive users who go under the community radar. Based on our empirical observations, we build a classifier that is able to detect abusive users with an accuracy as high as 83%.
The Lifecycles of Apps in a Social Ecosystem BIBAFull-Text 581-591
  Isabel Kloumann; Lada Adamic; Jon Kleinberg; Shaomei Wu
Apps are emerging as an important form of on-line content, and they combine aspects of Web usage in interesting ways -- they exhibit a rich temporal structure of user adoption and long-term engagement, and they exist in a broader social ecosystem that helps drive these patterns of adoption and engagement. It has been difficult, however, to study apps in their natural setting since this requires a simultaneous analysis of a large set of popular apps and the underlying social network they inhabit. In this work we address this challenge through an analysis of the collection of apps on Facebook Login, developing a novel framework for analyzing both temporal and social properties. At the temporal level, we develop a retention model that represents a user's tendency to return to an app using a very small parameter set. At the social level, we organize the space of apps along two fundamental axes -- popularity and sociality -- and we show how a user's probability of adopting an app depends both on properties of the local network structure and on the match between the user's attributes, his or her friends' attributes, and the dominant attributes within the app's user population. We also develop models that show the importance of different feature sets with strong performance in predicting app success.
Getting More for Less: Optimized Crowdsourcing with Dynamic Tasks and Goals BIBAFull-Text 592-602
  Ari Kobren; Chun How Tan; Panagiotis Ipeirotis; Evgeniy Gabrilovich
In crowdsourcing systems, the interests of contributing participants and system stakeholders are often not fully aligned. Participants seek to learn, be entertained, and perform easy tasks, which offer them instant gratification; system stakeholders want users to complete more difficult tasks, which bring higher value to the crowdsourced application. We directly address this problem by presenting techniques that optimize the crowdsourcing process by jointly maximizing the user longevity in the system and the true value that the system derives from user participation.
   We first present models that predict the "survival probability" of a user at any given moment, that is, the probability that a user will proceed to the next task offered by the system. We then leverage this survival model to dynamically decide what task to assign and what motivating goals to present to the user. This allows us to jointly optimize for the short term (getting difficult tasks done) and for the long term (keeping users engaged for longer periods of time).
   We show that dynamically assigning tasks significantly increases the value of a crowdsourcing system. In an extensive empirical evaluation, we observed that our task allocation strategy increases the amount of information collected by up to 117.8%. We also explore the utility of motivating users with goals. We demonstrate that setting specific, static goals can be highly detrimental to the long-term user participation, as the completion of a goal (e.g., earning a badge) is also a common drop-off point for many users. We show that setting the goals dynamically, in conjunction with judicious allocation of tasks, increases the amount of information collected by the crowdsourcing system by up to 249%, compared to the existing baselines that use fixed objectives.
Evolution of Conversations in the Age of Email Overload BIBAFull-Text 603-613
  Farshad Kooti; Luca Maria Aiello; Mihajlo Grbovic; Kristina Lerman; Amin Mantrach
Email is a ubiquitous communications tool in the workplace and plays an important role in social interactions. Previous studies of email were largely based on surveys and limited to relatively small populations of email users within organizations. In this paper, we report results of a large-scale study of more than 2 million users exchanging 16 billion emails over several months. We quantitatively characterize the replying behavior in conversations within pairs of users. In particular, we study the time it takes the user to reply to a received message and the length of the reply sent. We consider a variety of factors that affect the reply time and length, such as the stage of the conversation, user demographics, and use of portable devices. In addition, we study how increasing load affects emailing behavior. We find that as users receive more email messages in a day, they reply to a smaller fraction of them, using shorter replies. However, their responsiveness remains intact, and they may even reply to emails faster. Finally, we predict the time to reply, length of reply, and whether the reply ends a conversation. We demonstrate considerable improvement over the baseline in all three prediction tasks, showing the significant role that the factors that we uncover play, in determining replying behavior. We rank these factors based on their predictive power. Our findings have important implications for understanding human behavior and designing better email management applications for tasks like ranking unread emails.
Events and Controversies: Influences of a Shocking News Event on Information Seeking BIBAFull-Text 614-624
  Danai Koutra; Paul N. Bennett; Eric Horvitz
It has been suggested that online search and retrieval contributes to the intellectual isolation of users within their preexisting ideologies, where people's prior views are strengthened and alternative viewpoints are infrequently encountered. This so-called "filter bubble" phenomenon has been called out as especially detrimental when it comes to dialog among people on controversial, emotionally charged topics, such as the labeling of genetically modified food, the right to bear arms, the death penalty, and online privacy. We seek to identify and study information-seeking behavior and access to alternative versus reinforcing viewpoints following shocking, emotional, and large-scale news events. We choose for a case study to analyze search and browsing on gun control/rights, a strongly polarizing topic for both citizens and leaders of the United States. We study the period of time preceding and following a mass shooting to understand how its occurrence, follow-on discussions, and debate may have been linked to changes in the patterns of searching and browsing. We employ information-theoretic measures to quantify the diversity of Web domains of interest to users and understand the browsing patterns of users. We use these measures to characterize the influence of news events on these web search and browsing patterns.
Statistically Significant Detection of Linguistic Change BIBAFull-Text 625-635
  Vivek Kulkarni; Rami Al-Rfou; Bryan Perozzi; Steven Skiena
We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word's meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time.
   We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.
Replacing the Irreplaceable: Fast Algorithms for Team Member Recommendation BIBAFull-Text 636-646
  Liangyue Li; Hanghang Tong; Nan Cao; Kate Ehrlich; Yu-Ru Lin; Norbou Buchler
In this paper, we study the problem of TEAM MEMBER REPLACEMENT -- given a team of people embedded in a social network working on the same task, find a good candidate to best replace a team member who becomes unavailable to perform the task for certain reason (e.g., conflicts of interests or resource capacity). Prior studies in teamwork have suggested that a good team member replacement should bring synergy to the team in terms of having both skill matching and structure matching. However, existing techniques either do not cover both aspects or consider the two aspects independently. In this work, we propose a novel problem formulation using the concept of graph kernels that takes into account the interaction of both skill and structure matching requirements. To tackle the computational challenges, we propose a family of fast algorithms by (a) designing effective pruning strategies, and (b) exploring the smoothness between the existing and the new team structures. We conduct extensive experimental evaluations and user studies on real world datasets to demonstrate the effectiveness and efficiency. Our algorithms (a) perform significantly better than the alternative choices in terms of both precision and recall and (b) scale sub-linearly.
Robust Group Linkage BIBAFull-Text 647-657
  Pei Li; Xin Luna Dong; Songtao Guo; Andrea Maurino; Divesh Srivastava
We study the problem of group linkage: linking records that refer to multiple entities in the same group. Applications for group linkage include finding businesses in the same chain, finding social network users from the same organization, and so on. Group linkage faces new challenges compared to traditional entity resolution. First, although different members in the same group can share some similar global values of an attribute, they represent different entities so can also have distinct local values for the same or different attributes, requiring a high tolerance for value diversity. Second, we need to be able to distinguish local values from erroneous values.
   We present a robust two-stage algorithm: the first stage identifies pivots -- maximal sets of records that are very likely to belong to the same group, while being robust to possible erroneous values; the second stage collects strong evidence from the pivots and leverages it for merging more records into the same group, while being tolerant to differences in local values of an attribute. Experimental results show the high effectiveness and efficiency of our algorithm on various real-world data sets.
Uncovering the Small Community Structure in Large Networks: A Local Spectral Approach BIBAFull-Text 658-668
  Yixuan Li; Kun He; David Bindel; John E. Hopcroft
Large graphs arise in a number of contexts and understanding their structure and extracting information from them is an important research area. Early algorithms on mining communities have focused on the global structure, and often run in time functional to the size of the entire graph. Nowadays, as we often explore networks with billions of vertices and find communities of size hundreds, it is crucial to shift our attention from macroscopic structure to microscopic structure when dealing with large networks. A growing body of work has been adopting local expansion methods in order to identify the community from a few exemplary seed members.
   In this paper, we propose a novel approach for finding overlapping communities called LEMON (Local Expansion via Minimum One Norm). Different from PageRank-like diffusion methods, LEMON finds the community by seeking a sparse vector in the span of the local spectra such that the seeds are in its support. We show that LEMON can achieve the highest detection accuracy among state-of-the-art proposals. The running time depends on the size of the community rather than that of the entire graph. The algorithm is easy to implement, and is highly parallelizable.
   Moreover, given that networks are not all similar in nature, a comprehensive analysis on how the local expansion approach is suited for uncovering communities in different networks is still lacking. We thoroughly evaluate our approach using both synthetic and real-world datasets across different domains, and analyze the empirical variations when applying our method to inherently different networks in practice. In addition, the heuristics on how the quality and quantity of the seed set would affect the performance are provided.
Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core Systems BIBAFull-Text 669-679
  Xiaosheng Liu; Jia Zeng; Xi Yang; Jianfeng Yan; Qiang Yang
Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling tool for content analysis such as web mining. To handle web-scale content analysis on just a single PC, we propose multi-core parallel expectation-maximization (PEM) algorithms to infer and estimate LDA parameters in shared memory systems. By avoiding memory access conflicts, reducing the locking time among multiple threads and residual-based dynamic scheduling, we show that PEM algorithms are more scalable and accurate than the current state-of-the-art parallel LDA algorithms on a commodity PC. This parallel LDA toolbox is made publicly available as open source software at mloss.org.
Grading the Graders: Motivating Peer Graders in a MOOC BIBAFull-Text 680-690
  Yanxin Lu; Joe Warren; Christopher Jermaine; Swarat Chaudhuri; Scott Rixner
In this paper, we detail our efforts at creating and running a controlled study designed to examine how students in a MOOC might be motivated to do a better job during peer grading. This study involves more than one thousand students of a popular MOOC. We ask two specific questions: (1) When a student knows that his or her own peer grading efforts are being examined by peers, does this knowledge alone tend to motivate the student to do a better job when grading assignments? And (2) when a student not only knows that his or her own peer grading efforts are being examined by peers, but he or she is also given a number of other peer grading efforts to evaluate (so the peer graders see how other peer graders evaluate assignments), do both of these together tend to motivate the student to do a better job when grading assignments? We find strong statistical evidence that "grading the graders" does in fact tend to increase the quality of peer grading.
Measurement and Analysis of Mobile Web Cache Performance BIBAFull-Text 691-701
  Yun Ma; Xuanzhe Liu; Shuhui Zhang; Ruirui Xiang; Yunxin Liu; Tao Xie
The Web browser is a killer app on mobile devices such as smartphones. However, the user experience of mobile Web browsing is undesirable because of the slow resource loading. To improve the performance of Web resource loading, caching has been adopted as a key mechanism. However, the existing passive measurement studies cannot comprehensively characterize the performance of mobile Web caching. For example, most of these studies mainly focus on client-side implementations but not server-side configurations, suffer from biased user behaviors, and fail to study "miscached" resources. To address these issues, in this paper, we present a proactive approach for a comprehensive measurement study on mobile Web cache performance. The key idea of our approach is to proactively crawl resources from hundreds of websites periodically with a fine-grained time interval. Thus, we are able to uncover the resource update history and cache configurations at the server side, and analyze the cache performance in various time granularities. Based on our collected data, we build a new cache analysis model and study the upper bound of how high percentage of resources could potentially be cached and how effective the caching works in practice. We report detailed analysis results of different websites and various types of Web resources, and identify the problems caused by unsatisfactory cache performance. In particular, we identify two major problems -- Redundant Transfer and Miscached Resource, which lead to unsatisfactory cache performance. We investigate three main root causes: Same Content, Heuristic Expiration, and Conservative Expiration Time, and discuss what mobile Web developers can do to mitigate those problems.

Technical Papers 2

SCULPT: A Schema Language for Tabular Data on the Web BIBAFull-Text 702-720
  Wim Martens; Frank Neven; Stijn Vansummeren
Inspired by the recent working effort towards a recommendation by the World Wide Web Consortium (W3C) for tabular data and metadata on the Web, we present in this paper a concept for a schema language for tabular web data called SCULPT. The language consists of rules constraining and defining the structure of regions in the table. These regions are defined through the novel formalism of region selection expressions. We present a formal model for SCULPT and obtain a linear time combined complexity evaluation algorithm. In addition, we consider weak and strong streaming evaluation for SCULPT and present a SCULPT fragment for each of these streaming variants. Finally, we discuss several extensions of SCULPT including alternative semantics, types, complex content, and explore region selection expressions as a basis for a transformation language.
The Web as a Jungle: Non-Linear Dynamical Systems for Co-evolving Online Activities BIBAFull-Text 721-731
  Yasuko Matsubara; Yasushi Sakurai; Christos Faloutsos
Given a large collection of co-evolving online activities, such as searches for the keywords "Xbox", "PlayStation" and "Wii", how can we find patterns and rules? Are these keywords related? If so, are they competing against each other? Can we forecast the volume of user activity for the coming month? We conjecture that online activities compete for user attention in the same way that species in an ecosystem compete for food. We present ECOWEB, (i.e., Ecosystem on the Web), which is an intuitive model designed as a non-linear dynamical system for mining large-scale co-evolving online activities. Our second contribution is a novel, parameter-free, and scalable fitting algorithm, ECOWEB-FIT, that estimates the parameters of ECOWEB. Extensive experiments on real data show that ECOWEB is effective, in that it can capture long-range dynamics and meaningful patterns such as seasonalities, and practical, in that it can provide accurate long-range forecasts. ECOWEB consistently outperforms existing methods in terms of both accuracy and execution speed.
Spanning Edge Centrality: Large-scale Computation and Applications BIBAFull-Text 732-742
  Charalampos Mavroforakis; Richard Garcia-Lebron; Ioannis Koutis; Evimaria Terzi
The spanning centrality of an edge e in an undirected graph G is the fraction of the spanning trees of G that contain e. Despite its appealing definition and apparent value in certain applications in computational biology, spanning centrality hasn't so far received a wider attention as a measure of edge centrality. We may partially attribute this to the perceived complexity of computing it, which appears to be prohibitive for very large networks. Contrary to this intuition, spanning centrality can in fact be approximated arbitrary well by very efficient near-linear time algorithms due to Spielman and Srivastava, combined with progress in linear system solvers. In this article we bring theory into practice, with careful and optimized implementations that allow the fast computation of spanning centrality in very large graphs with millions of nodes. With this computational tool in our disposition, we demonstrate experimentally that spanning centrality is in fact a useful tool for the analysis of large networks. Specifically, we show that, relative to common centrality measures, spanning centrality is more effective in identifying edges whose removal causes a higher disruption in an information propagation procedure, while being very resilient to noise, in terms of both the edges scores and the resulting edge ranking.
No Escape From Reality: Security and Privacy of Augmented Reality Browsers BIBAFull-Text 743-753
  Richard McPherson; Suman Jana; Vitaly Shmatikov
Augmented reality (AR) browsers are an emerging category of mobile applications that add interactive virtual objects to the user's view of the physical world. This paper gives the first system-level evaluation of their security and privacy properties.
   We start by analyzing the functional requirements that AR browsers must support in order to present AR content. We then investigate the security architecture of Junaio, Layar, and Wikitude browsers, which are running today on over 30 million mobile devices, and identify new categories of security and privacy vulnerabilities unique to AR browsers. Finally, we provide the first engineering guidelines for securely implementing AR functionality.
Discovering Meta-Paths in Large Heterogeneous Information Networks BIBAFull-Text 754-764
  Changping Meng; Reynold Cheng; Silviu Maniu; Pierre Senellart; Wangda Zhang
The Heterogeneous Information Network (HIN) is a graph data model in which nodes and edges are annotated with class and relationship labels. Large and complex datasets, such as Yago or DBLP, can be modeled as HINs. Recent work has studied how to make use of these rich information sources. In particular, meta-paths, which represent sequences of node classes and edge types between two nodes in a HIN, have been proposed for such tasks as information retrieval, decision making, and product recommendation. Current methods assume meta-paths are found by domain experts. However, in a large and complex HIN, retrieving meta-paths manually can be tedious and difficult. We thus study how to discover meta-paths automatically. Specifically, users are asked to provide example pairs of nodes that exhibit high proximity. We then investigate how to generate meta-paths that can best explain the relationship between these node pairs. Since this problem is computationally intractable, we propose a greedy algorithm to select the most relevant meta-paths. We also present a data structure to enable efficient execution of this algorithm. We further incorporate hierarchical relationships among node classes in our solutions. Extensive experiments on real-world HIN show that our approach captures important meta-paths in an efficient and scalable manner.
From "Selena Gomez" to "Marlon Brando": Understanding Explorative Entity Search BIBAFull-Text 765-775
  Iris Miliaraki; Roi Blanco; Mounia Lalmas
Consider a user who submits a search query "Shakira" having a specific search goal in mind (such as her age) but at the same time willing to explore information for other entities related to her, such as comparable singers. In previous work, a system called Spark, was developed to provide such search experience. Given a query submitted to the Yahoo search engine, Spark provides related entity suggestions for the query, exploiting, among else, public knowledge bases from the Semantic Web. We refer to this search scenario as explorative entity search. The effectiveness and efficiency of the approach has been demonstrated in previous work. The way users interact with these related entity suggestions and whether this interaction can be predicted have however not been studied. In this paper, we perform a large-scale analysis into how users interact with the entity results returned by Spark. We characterize the users, queries and sessions that appear to promote an explorative behavior. Based on this analysis, we develop a set of query and user-based features that reflect the click behavior of users and explore their effectiveness in the context of a prediction task.
Children Seen But Not Heard: When Parents Compromise Children's Online Privacy BIBAFull-Text 776-786
  Tehila Minkus; Kelvin Liu; Keith W. Ross
Children's online privacy has garnered much attention in media, legislation, and industry. Adults are concerned that children may not adequately protect themselves online. However, relatively little discussion has focused on the privacy breaches that may occur to children at the hands of others, namely, their parents and relatives. When adults post information online, they may reveal personal information about their children to other people, online services, data brokers, or surveillant authorities. This information can be gathered in an automated fashion and then linked with other online and offline sources, creating detailed profiles which can be continually enhanced throughout the children's lives.
   In this paper, we conduct a study to see how widespread these behaviors are among adults on Facebook and Instagram. We use a number of methods. Firstly, we automate a process to examine 2,383 adult users on Facebook for evidence of children in their public photo albums. Using the associated comments in combination with publicly available voter registration records, we are able to infer children's names, faces, birth dates, and addresses. Secondly, in order to understand what additional information is available to Facebook and the users' friends, we survey 357 adult Facebook users about their behaviors and attitudes with regard to posting their children's information online. Thirdly, we analyze 1,089 users on Instagram to infer facts about their children.
   Finally, we make recommendations for privacy-conscious parents and suggest an interface change through which Facebook can nudge parents towards better stewardship of their children's privacy.
TrueView: Harnessing the Power of Multiple Review Sites BIBAFull-Text 787-797
  Amanda J. Minnich; Nikan Chavoshi; Abdullah Mueen; Shuang Luan; Michalis Faloutsos
Online reviews on products and services can be very useful for customers, but they need to be protected from manipulation. So far, most studies have focused on analyzing online reviews from a single hosting site. How could one leverage information from multiple review hosting sites? This is the key question in our work. In response, we develop a systematic methodology to merge, compare, and evaluate reviews from multiple hosting sites. We focus on hotel reviews and use more than 15 million reviews from more than 3.5 million users spanning three prominent travel sites. Our work consists of three thrusts: (a) we develop novel features capable of identifying cross-site discrepancies effectively, (b) we conduct arguably the first extensive study of cross-site variations using real data, and develop a hotel identity-matching method with 93% accuracy, (c) we introduce the TrueView score, as a proof of concept that cross-site analysis can better inform the end user. Our results show that: (1) we detect 7 times more suspicious hotels by using multiple sites compared to using the three sites in isolation, and (2) we find that 20% of all hotels appearing in all three sites seem to have low trustworthiness score. Our work is an early effort that explores the advantages and the challenges in using multiple reviewing sites towards more informed decision making.
QUOTUS: The Structure of Political Media Coverage as Revealed by Quoting Patterns BIBAFull-Text 798-808
  Vlad Niculae; Caroline Suen; Justine Zhang; Cristian Danescu-Niculescu-Mizil; Jure Leskovec
Given the extremely large pool of events and stories available, media outlets need to focus on a subset of issues and aspects to convey to their audience. Outlets are often accused of exhibiting a systematic bias in this selection process, with different outlets portraying different versions of reality. However, in the absence of objective measures and empirical evidence, the direction and extent of systematicity remains widely disputed. In this paper we propose a framework based on quoting patterns for quantifying and characterizing the degree to which media outlets exhibit systematic bias. We apply this framework to a massive dataset of news articles spanning the six years of Obama's presidency and all of his speeches, and reveal that a systematic pattern does indeed emerge from the outlet's quoting behavior. Moreover, we show that this pattern can be successfully exploited in an unsupervised prediction setting, to determine which new quotes an outlet will select to broadcast. By encoding bias patterns in a low-rank space we provide an analysis of the structure of political media coverage. This reveals a latent media bias space that aligns surprisingly well with political ideology and outlet type. A linguistic analysis exposes striking differences across these latent dimensions, showing how the different types of media outlets portray different realities even when reporting on the same events. For example, outlets mapped to the mainstream conservative side of the latent space focus on quotes that portray a presidential persona disproportionately characterized by negativity.
Energy and Performance of Smartphone Radio Bundling in Outdoor Environments BIBAFull-Text 809-819
  Ana Nika; Yibo Zhu; Ning Ding; Abhilash Jindal; Y. Charlie Hu; Xia Zhou; Ben Y. Zhao; Haitao Zheng
Most of today's mobile devices come equipped with both cellular LTE and WiFi wireless radios, making radio bundling (simultaneous data transfers over multiple interfaces) both appealing and practical. Despite recent studies documenting the benefits of radio bundling with MPTCP, many fundamental questions remain about potential gains from radio bundling, or the relationship between performance and energy consumption in these scenarios. In this study, we seek to answer these questions using extensive measurements to empirically characterize both energy and performance for radio bundling approaches. In doing so, we quantify potential gains of bundling using MPTCP versus an ideal protocol. We study the links between traffic partitioning and bundling performance, and use a novel componentized energy model to quantify the energy consumed by CPUs (and radios) during traffic management. Our results show that MPTCP achieves only a fraction of the total performance gain possible, and that its energy-agnostic design leads to considerable power consumption by the CPU. We conclude that not only there is room for improved bundling performance, but an energy-aware bundling protocol is likely to achieve a much better tradeoff between performance and power consumption.
PriVaricator: Deceiving Fingerprinters with Little White Lies BIBAFull-Text 820-830
  Nick Nikiforakis; Wouter Joosen; Benjamin Livshits
Researchers have shown that, in recent years, unwanted web tracking is on the rise, with browser-based fingerprinting being adopted by more and more websites as a viable alternative to third-party cookies. In this paper we propose PriVaricator, a solution to the problem of browser-based fingerprinting. A key insight is that when it comes to web tracking, the real problem with fingerprinting is not uniqueness of a fingerprint, it is linkability, i.e., the ability to connect the same fingerprint across multiple visits. Thus, making fingerprints non-deterministic also makes them hard to link across browsing sessions. In PriVaricator we use the power of randomization to "break" linkability by exploring a space of parameterized randomization policies. We evaluate our techniques in terms of being able to prevent fingerprinting and not breaking existing (benign) sites. The best of our randomization policies renders all the fingerprinters we tested ineffective, while causing minimal damage on a set of 1000 Alexa sites on which we tested, with no noticeable performance overhead.
Diagnoses, Decisions, and Outcomes: Web Search as Decision Support for Cancer BIBAFull-Text 831-841
  Michael J. Paul; Ryen W. White; Eric Horvitz
People diagnosed with a serious illness often turn to the Web for their rising information needs, especially when decisions are required. We analyze the search and browsing behavior of searchers who show a surge of interest in prostate cancer. Prostate cancer is the most common serious cancer in men and is a leading cause of cancer-related death. Diagnoses of prostate cancer typically involve reflection and decision making about treatment based on assessments of preferences and outcomes. We annotated timelines of treatment-related queries from nearly 300 searchers with tags indicating different phases of treatment, including decision making, preparation, and recovery. Using this corpus, we present a variety of analyses toward the goal of understanding search and decision making about treatments. We characterize search queries and the content of accessed pages for different treatment phases, model search behavior during the decision-making phase, and create an aggregate alignment of treatment timelines illustrated with a variety of visualizations. The experiments provide insights about how people who are engaged in intensive searches about prostate cancer over an extended period of time pursue and access information from the Web.
PocketTrend: Timely Identification and Delivery of Trending Search Content to Mobile Users BIBAFull-Text 842-852
  Gennady Pekhimenko; Dimitrios Lymberopoulos; Oriana Riva; Karin Strauss; Doug Burger
Trending search topics cause unpredictable query load spikes that hurt the end-user search experience, particularly the mobile one, by introducing longer delays. To understand how trending search topics are formed and evolve over time, we analyze 21 million queries submitted during periods where popular events caused search query volume spikes. Based on our findings, we design and evaluate PocketTrend, a system that automatically detects trending topics in real time, identifies the search content associated to the topics, and then intelligently pushes this content to users in a timely manner. In that way, PocketTrend enables a client-side search engine that can instantly answer user queries related to trending events, while at the same time reducing the impact of these trends on the datacenter workload. Our results, using real mobile search logs, show that in the presence of a trending event, up to 13-17% of the overall search traffic can be eliminated from the datacenter, with as many as 19% of all users benefiting from PocketTrend.
Overcoming Relational Learning Biases to Accurately Predict Preferences in Large Scale Networks BIBAFull-Text 853-863
  Joseph J., III Pfeiffer; Jennifer Neville; Paul N. Bennett
Many individuals on social networking sites provide traits about themselves, such as interests or demographics. Social networking sites can use this information to provide better content to match their users' interests, such as recommending scheduled events or various relevant products. These tasks require accurate probability estimates to determine the correct answer to return. Relational machine learning (RML) is an excellent framework for these problems as it jointly models the user labels given their attributes and the relational structure. Further, semi-supervised learning methods could enable RML methods to exploit the large amount of unlabeled data in networks.
   However, existing RML approaches have limitations that prevent their application in large scale domains. First, semi-supervised methods for RML do not fully utilize all the unlabeled instances in the network. Second, the collective inference procedures necessary to jointly infer the missing labels are generally viewed as too expensive to apply in large scale domains. In this work, we address each of these limitations. We analyze the effect of full semi-supervised RML and find that collective inference methods can introduce considerable bias into predictions. We correct this by implementing a maximum entropy constraint on the inference step, forcing the predictions to have the same distribution as the observed labels. Next, we outline a massively scalable variational inference algorithm for large scale relational network domains. We extend this inference algorithm to incorporate the maximum entropy constraint, proving that it only requires a constant amount of overhead while remaining massively parallel. We demonstrate our method's improvement over a variety of baselines on seven real world datasets, including large scale networks with over five million edges.
Deriving an Emergent Relational Schema from RDF Data BIBAFull-Text 864-874
  Minh-Duc Pham; Linnea Passing; Orri Erling; Peter Boncz
We motivate and describe techniques that allow to detect an "emergent" relational schema from RDF data. We show that on a wide variety of datasets, the found structure explains well over 90% of the RDF triples. Further, we also describe technical solutions to the semantic challenge to give short names that humans find logical to these emergent tables, columns and relationships between tables. Our techniques can be exploited in many ways, e.g., to improve the efficiency of SPARQL systems, or to use existing SQL-based applications on top of any RDF dataset using a RDBMS.
The Digital Life of Walkable Streets BIBAFull-Text 875-884
  Daniele Quercia; Luca Maria Aiello; Rossano Schifanella; Adam Davies
Walkability has many health, environmental, and economic benefits. That is why web and mobile services have been offering ways of computing walkability scores of individual street segments. Those scores are generally computed from survey data and manual counting (of even trees). However, that is costly, owing to the high time, effort, and financial costs. To partly automate the computation of those scores, we explore the possibility of using the social media data of Flickr and Foursquare to automatically identify safe and walkable streets. We find that unsafe streets tend to be photographed during the day, while walkable streets are tagged with walkability-related keywords. These results open up practical opportunities (for, e.g., room booking services, urban route recommenders, and real-estate sites) and have theoretical implications for researchers who might resort to the use social media data to tackle previously unanswered questions in the area of walkability.
Beyond Models: Forecasting Complex Network Processes Directly from Data BIBAFull-Text 885-895
  Bruno Ribeiro; Minh X. Hoang; Ambuj K. Singh
Complex network phenomena -- such as information cascades in online social networks -- are hard to fully observe, model, and forecast. In forecasting, a recent trend has been to forgo the use of parsimonious models in favor of models with increasingly large degrees of freedom that are trained to learn the behavior of a process from historical data. Extrapolating this trend into the future, eventually we would renounce models all together. But is it possible to forecast the evolution of a complex stochastic process directly from the data without a model? In this work we show that model-free forecasting is possible. We present SED, an algorithm that forecasts process statistics based on relationships of statistical equivalence using two general axioms and historical data. To the best of our knowledge, SED is the first method that can perform axiomatic, model-free forecasts of complex stochastic processes. Our simulations using simple and complex evolving processes and tests performed on a large real-world dataset show promising results.
Weakly Supervised Extraction of Computer Security Events from Twitter BIBAFull-Text 896-905
  Alan Ritter; Evan Wright; William Casey; Tom Mitchell
Twitter contains a wealth of timely information, however staying on top of breaking events requires that an information analyst constantly scan many sources, leading to information overload. For example, a user might wish to be made aware whenever an infectious disease outbreak takes place, when a new smartphone is announced or when a distributed Denial of Service (DoS) attack might affect an organization's network connectivity. There are many possible event categories an analyst may wish to track, making it impossible to anticipate all those of interest in advance. We therefore propose a weakly supervised approach, in which extractors for new categories of events are easy to define and train, by specifying a small number of seed examples. We cast seed-based event extraction as a learning problem where only positive and unlabeled data is available. Rather than assuming unlabeled instances are negative, as is common in previous work, we propose a learning objective which regularizes the label distribution towards a user-provided expectation. Our approach greatly outperforms heuristic negatives, used in most previous work, in experiments on real-world data. Significant performance gains are also demonstrated over two novel and competitive baselines: semi-supervised EM and one-class support-vector machines. We investigate three security-related events breaking on Twitter: DoS attacks, data breaches and account hijacking. A demonstration of security events extracted by our system is available at: http://kb1.cse.ohio-state.edu:8123/events/hacked
Groupsourcing: Team Competition Designs for Crowdsourcing BIBAFull-Text 906-915
  Markus Rokicki; Sergej Zerr; Stefan Siersdorfer
Many data processing tasks such as semantic annotation of images, translation of texts in foreign languages, and labeling of training data for machine learning models require human input, and, on a large scale, can only be accurately solved using crowd based online work. Recent work shows that frameworks where crowd workers compete against each other can drastically reduce crowdsourcing costs, and outperform conventional reward schemes where the payment of online workers is proportional to the number of accomplished tasks ("pay-per-task"). In this paper, we investigate how team mechanisms can be leveraged to further improve the cost efficiency of crowdsourcing competitions. To this end, we introduce strategies for team based crowdsourcing, ranging from team formation processes where workers are randomly assigned to competing teams, over strategies involving self-organization where workers actively participate in team building, to combinations of team and individual competitions. Our large-scale experimental evaluation with more than 1,100 participants and overall 5,400 hours of work spent by crowd workers demonstrates that our team based crowdsourcing mechanisms are well accepted by online workers and lead to substantial performance boosts.
Authentication Melee: A Usability Analysis of Seven Web Authentication Systems BIBAFull-Text 916-926
  Scott Ruoti; Brent Roberts; Kent Seamons
Passwords continue to dominate the authentication landscape in spite of numerous proposals to replace them. Even though usability is a key factor in replacing passwords, very few alternatives have been subjected to formal usability studies, and even fewer have been analyzed using a standard metric. We report the results of four within-subjects usability studies for seven web authentication systems. These systems span federated, smartphone, paper tokens, and email-based approaches. Our results indicate that participants prefer single sign-on systems. We report several insightful findings based on participants' qualitative responses: (1) transparency increases usability but also leads to confusion and a lack of trust, (2) participants prefer single sign-on but wish to augment it with site-specific low-entropy passwords, and (3) participants are intrigued by biometrics and phone-based authentication. We utilize the Systems Usability Scale (SUS) as a standard metric for empirical analysis and find that it produces reliable, replicable results. SUS proves to be an accurate measure of baseline usability. We recommend that new authentication systems be formally evaluated for usability using SUS, and should meet a minimum acceptable SUS score before receiving serious consideration.
Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions BIBAFull-Text 927-937
  Ahmet Erdem Sariyuce; C. Seshadhri; Ali Pinar; Umit V. Catalyurek
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to find the "true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. Current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations. We define the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which enables discovering overlapping dense subgraphs. With the right parameters, the nucleus decomposition generalizes the classic notions of k-cores and k-truss decompositions. We give provably efficient algorithms for nucleus decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-the-art solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
Bringing CUPID Indoor Positioning System to Practice BIBAFull-Text 938-948
  Souvik Sen; Dongho Kim; Stephane Laroche; Kyu-Han Kim; Jeongkeun Lee
WiFi based indoor positioning has recently gained more attention due to the advent of the IEEE 802.11v standard, requirements by the FCC for E911 calls, and increased interest in location-based services. While there exist several indoor localization techniques, we find that these techniques tradeoff either accuracy, scalability, pervasiveness or cost -- all of which are important requirements for a truly deployable positioning solution. Wireless signal-strength based approaches suffer from location errors, whereas time-of-flight (ToF) based solutions provide good accuracy but are not scalable. Recent solutions address these issues by augmenting WiFi with either smartphone sensing or mobile crowdsourcing. However, they require tight coupling between WiFi infrastructure and a client device, or they can determine the client's location only if it is mobile. In this paper, we present CUPID2.0 which improved our previously proposed CUPID indoor positioning system to overcome these limitations. We achieve this by addressing the fundamental limitations in Time-of-Flight based localization and combining ToF with signal strength to address scalability. Experiments from 6 cities using 40 different mobile devices, comprising of more than 2.5 million location fixes demonstrate feasibility. CUPID2.0 is currently under production, and we expect CUPID2.0 to ignite the wide adoption of WLAN-based positioning systems and their services.
Early Detection of Spam Mobile Apps BIBAFull-Text 949-959
  Suranga Seneviratne; Aruna Seneviratne; Mohamed Ali Kaafar; Anirban Mahanti; Prasant Mohapatra
Increased popularity of smartphones has attracted a large number of developers to various smartphone platforms. As a result, app markets are also populated with spam apps, which reduce the users' quality of experience and increase the workload of app market operators. Apps can be "spammy" in multiple ways including not having a specific functionality, unrelated app description or unrelated keywords and publishing similar apps several times and across diverse categories. Market operators maintain anti-spam policies and apps are removed through continuous human intervention. Through a systematic crawl of a popular app market and by identifying a set of removed apps, we propose a method to detect spam apps solely using app metadata available at the time of publication. We first propose a methodology to manually label a sample of removed apps, according to a set of checkpoint heuristics that reveal the reasons behind removal. This analysis suggests that approximately 35% of the apps being removed are very likely to be spam apps. We then map the identified heuristics to several quantifiable features and show how distinguishing these features are for spam apps. Finally, we build an Adaptive Boost classifier for early identification of spam apps using only the metadata of the apps. Our classifier achieves an accuracy over 95% with precision varying between 85%-95% and recall varying between 38%-98%. By applying the classifier on a set of apps present at the app market during our crawl, we estimate that at least 2.7% of them are spam apps.
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance BIBAFull-Text 960-970
  Masumi Shirakawa; Takahiro Hara; Shojiro Nishio
This paper first reveals the relationship between Inverse Document Frequency (IDF), a global term weighting scheme, and information distance, a universal metric defined by Kolmogorov complexity. We concretely give a theoretical explanation that the IDF of a term is equal to the distance between the term and the empty string in the space of information distance in which the Kolmogorov complexity is approximated using Web documents and the Shannon-Fano coding. Based on our findings, we propose N-gram IDF, a theoretical extension of IDF for handling words and phrases of any length. By comparing weights among N-grams of any N, N-gram IDF enables us to determine dominant N-grams among overlapping ones and extract key terms of any length from texts without using any NLP techniques. To efficiently compute the weight for all possible N-grams, we adopt two string processing techniques, i.e., maximal substring extraction using enhanced suffix array and document listing using wavelet tree. We conducted experiments on key term extraction and Web search query segmentation, and found that N-gram IDF was competitive with state-of-the-art methods that were designed for each application using additional resources and efforts. The results exemplified the potential of N-gram IDF.
Query Suggestion and Data Fusion in Contextual Disambiguation BIBAFull-Text 971-980
  Milad Shokouhi; Marc Sloan; Paul N. Bennett; Kevyn Collins-Thompson; Siranush Sarkizova
Queries issued to a search engine are often under-specified or ambiguous. The user's search context or background may provide information that disambiguates their information need in order to automatically predict and issue a more effective query. The disambiguation can take place at different stages of the retrieval process. For instance, contextual query suggestions may be computed and recommended to users on the result page when appropriate, an approach that does not require modifying the original query's results. Alternatively, the search engine can attempt to provide efficient access to new relevant documents by injecting these documents directly into search results based on the user's context.
   In this paper, we explore these complementary approaches and how they might be combined. We first develop a general framework for mining context-sensitive query reformulations for query suggestion. We evaluate our context-sensitive suggestions against a state-of-the-art baseline using a click-based metric. The resulting query suggestions generated by our approach outperform the baseline by 13% overall and by 16% on an ambiguous query subset.
   While the query suggestions generated by our approach have higher quality than the existing baselines, we demonstrate that using them naively for injecting new documents into search results can lead to inferior rankings. To remedy this issue, we develop a classifier that decides when to inject new search results using features based on suggestion quality and user context. We show that our context-sensitive result fusion approach (Corfu) improves retrieval quality for ambiguous queries by up to 2.92%. Our approaches can efficiently scale to massive search logs, enabling a data-driven strategy that benefits from observing how users issue and reformulate queries in different contexts.
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment BIBAFull-Text 981-991
  Anshumali Shrivastava; Ping Li
Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such a penalization is not desirable. In this paper, we propose asymmetric minwise hashing (MH-ALSH), to provide a solution to this well-known problem. The new scheme utilizes asymmetric transformations to cancel the bias of traditional minhash towards smaller sets, making the final "collision probability" monotonic in the inner product. Our theoretical comparisons show that, for the task of retrieving with binary inner products, asymmetric minhash is provably better than traditional minhash and other recently proposed hashing algorithms for general inner products. Thus, we obtain an algorithmic improvement over existing approaches in the literature. Experimental evaluations on four publicly available high-dimensional datasets validate our claims. The proposed scheme outperforms, often significantly, other hashing algorithms on the task of near neighbor retrieval with set containment. Our proposal is simple and easy to implement in practice.
Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning BIBAFull-Text 992-1002
  Edwin D. Simpson; Matteo Venanzi; Steven Reece; Pushmeet Kohli; John Guiver; Stephen J. Roberts; Nicholas R. Jennings
Social media has led to the democratisation of opinion sharing. A wealth of information about public opinions, current events, and authors' insights into specific topics can be gained by understanding the text written by users. However, there is a wide variation in the language used by different authors in different contexts on the web. This diversity in language makes interpretation an extremely challenging task. Crowdsourcing presents an opportunity to interpret the sentiment, or topic, of free-text. However, the subjectivity and bias of human interpreters raise challenges in inferring the semantics expressed by the text. To overcome this problem, we present a novel Bayesian approach to language understanding that relies on aggregated crowdsourced judgements. Our model encodes the relationships between labels and text features in documents, such as tweets, web articles, and blog posts, accounting for the varying reliability of human labellers. It allows inference of annotations that scales to arbitrarily large pools of documents. Our evaluation using two challenging crowdsourcing datasets shows that by efficiently exploiting language models learnt from aggregated crowdsourced labels, we can provide up to 25% improved classifications when only a small portion, less than 4% of documents has been labelled. Compared to the six state-of-the-art methods, we reduce by up to 67% the number of crowd responses required to achieve comparable accuracy. Our method was a joint winner of the CrowdFlower -- CrowdScale 2013 Shared Task challenge at the conference on Human Computation and Crowdsourcing (HCOMP 2013).
HypTrails: A Bayesian Approach for Comparing Hypotheses About Human Trails on the Web BIBAFull-Text 1003-1013
  Philipp Singer; Denis Helic; Andreas Hotho; Markus Strohmaier
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful for e.g., improving underlying network structures, predicting user clicks or enhancing recommendations. In this work, we present a general approach called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our approach utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to leverage the sensitivity of Bayes factors on the prior for comparing hypotheses with each other. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including website navigation, business reviews and online music played. Our work expands the repertoire of methods available for studying human trails on the Web.
Exploiting Collective Hidden Structures in Webpage Titles for Open Domain Entity Extraction BIBAFull-Text 1014-1024
  Wei Song; Shiqi Zhao; Chao Zhang; Hua Wu; Haifeng Wang; Lizhen Liu; Hanshi Wang
We present a novel method for open domain named entity extraction by exploiting the collective hidden structures in webpage titles. Our method uncovers the hidden textual structures shared by sets of webpage titles based on generalized URL patterns and a multiple sequence alignment technique. The highlights of our method include: 1) The boundaries of entities can be identified automatically in a collective way without any manually designed pattern, seed or class name. 2) The connections between entities are also discovered naturally based on the hidden structures, which makes it easy to incorporate distant or weak supervision. The experiments show that our method can harvest large scale of open domain entities with high precision. A large ratio of the extracted entities are long-tailed and complex and cover diverse topics. Given the extracted entities and their connections, we further show the effectiveness of our method in a weakly supervised setting. Our method can produce better domain specific entities in both precision and recall compared with the state-of-the-art approaches.
ROCKER: A Refinement Operator for Key Discovery BIBAFull-Text 1025-1033
  Tommaso Soru; Edgard Marx; Axel-Cyrille Ngonga Ngomo
The Linked Data principles provide a decentral approach for publishing structured data in the RDF format on the Web. In contrast to structured data published in relational databases where a key is often provided explicitly, finding a set of properties that allows identifying a resource uniquely is a non-trivial task. Still, finding keys is of central importance for manifold applications such as resource deduplication, link discovery, logical data compression and data integration. In this paper, we address this research gap by specifying a refinement operator, dubbed ROCKER, which we prove to be finite, proper and non-redundant. We combine the theoretical characteristics of this operator with two monotonicities of keys to obtain a time-efficient approach for detecting keys, i.e., sets of properties that describe resources uniquely. We then utilize a hash index to compute the discriminability score efficiently. Therewith, we ensure that our approach can scale to very large knowledge bases. Results show that ROCKER yields more accurate results, has a comparable runtime, and consumes less memory w.r.t. existing state-of-the-art techniques.
Random Walk TripleRush: Asynchronous Graph Querying and Sampling BIBAFull-Text 1034-1044
  Philip Stutz; Bibek Paudel; Mihaela Verman; Abraham Bernstein
Most Semantic Web applications rely on querying graphs, typically by using SPARQL with a triple store. Increasingly, applications also analyze properties of the graph structure to compute statistical inferences. The current Semantic Web infrastructure, however, does not efficiently support such operations. This forces developers to extract the relevant data for external statistical post-processing. In this paper we propose to rethink query execution in a triple store as a highly parallelized asynchronous graph exploration on an active index data structure. This approach also allows to integrate SPARQL-querying with the sampling of graph properties.
   To evaluate this architecture we implemented Random Walk TripleRush, which is built on a distributed graph processing system. Our evaluations show that this architecture enables both competitive graph querying, as well as the ability to execute various types of random walks with restarts that sample interesting graph properties. Thanks to the asynchronous architecture, first results are sometimes returned in a fraction of the full execution time. We also evaluate the scalability and show that the architecture supports fast query-times on a dataset with more than a billion triples.
Open Domain Question Answering via Semantic Enrichment BIBAFull-Text 1045-1055
  Huan Sun; Hao Ma; Wen-tau Yih; Chen-Tse Tsai; Jingjing Liu; Ming-Wei Chang
Most recent question answering (QA) systems query large-scale knowledge bases (KBs) to answer a question, after parsing and transforming natural language questions to KBs-executable forms (e.g., logical forms). As a well-known fact, KBs are far from complete, so that information required to answer questions may not always exist in KBs. In this paper, we develop a new QA system that mines answers directly from the Web, and meanwhile employs KBs as a significant auxiliary to further boost the QA performance. Specifically, to the best of our knowledge, we make the first attempt to link answer candidates to entities in Freebase, during answer candidate generation. Several remarkable advantages follow: (1) Redundancy among answer candidates is automatically reduced. (2) The types of an answer candidate can be effortlessly determined by those of its corresponding entity in Freebase. (3) Capitalizing on the rich information about entities in Freebase, we can develop semantic features for each answer candidate after linking them to Freebase. Particularly, we construct answer-type related features with two novel probabilistic models, which directly evaluate the appropriateness of an answer candidate's types under a given question. Overall, such semantic features turn out to play significant roles in determining the true answers from the large answer candidate pool. The experimental results show that across two testing datasets, our QA system achieves an 18%~54% improvement under F_1 metric, compared with various existing QA systems.
All Who Wander: On the Prevalence and Characteristics of Multi-community Engagement BIBAFull-Text 1056-1066
  Chenhao Tan; Lillian Lee
Although analyzing user behavior within individual communities is an active and rich research domain, people usually interact with multiple communities both on- and off-line. How do users act in such multi-community environments? Although there are a host of intriguing aspects to this question, it has received much less attention in the research community in comparison to the intra-community case. In this paper, we examine three aspects of multi-community engagement: the sequence of communities that users post to, the language that users employ in those communities, and the feedback that users receive, using longitudinal posting behavior on Reddit as our main data source, and DBLP for auxiliary experiments. We also demonstrate the effectiveness of features drawn from these aspects in predicting users' future level of activity. One might expect that a user's trajectory mimics the "settling-down" process in real life: an initial exploration of sub-communities before settling down into a few niches. However, we find that the users in our data continually post in new communities; moreover, as time goes on, they post increasingly evenly among a more diverse set of smaller communities. Interestingly, it seems that users that eventually leave the community are "destined" to do so from the very beginning, in the sense of showing significantly different "wandering" patterns very early on in their trajectories; this finding has potentially important design implications for community maintainers. Our multi-community perspective also allows us to investigate the "situation vs. personality" debate from language usage across different communities.
LINE: Large-scale Information Network Embedding BIBAFull-Text 1067-1077
  Jian Tang; Meng Qu; Mingzhe Wang; Ming Zhang; Jun Yan; Qiaozhu Mei
This paper studies the problem of embedding very large information networks into low-dimensional vector spaces, which is useful in many tasks such as visualization, node classification, and link prediction. Most existing graph embedding methods do not scale for real world information networks which usually contain millions of nodes. In this paper, we propose a novel network embedding method called the "LINE," which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures. An edge-sampling algorithm is proposed that addresses the limitation of the classical stochastic gradient descent and improves both the effectiveness and the efficiency of the inference. Empirical experiments prove the effectiveness of the LINE on a variety of real-world information networks, including language networks, social networks, and citation networks. The algorithm is very efficient, which is able to learn the embedding of a network with millions of vertices and billions of edges in a few hours on a typical single machine. The source code of the LINE is available online: https://github.com/tangjianpku/LINE.
Leveraging Pattern Semantics for Extracting Entities in Enterprises BIBAFull-Text 1078-1088
  Fangbo Tao; Bo Zhao; Ariel Fuxman; Yang Li; Jiawei Han
Entity Extraction is a process of identifying meaningful entities from text documents. In enterprises, extracting entities improves enterprise efficiency by facilitating numerous applications, including search, recommendation, etc. However, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redundancy of enterprise entities makes previous web-based systems like NELL and OpenIE not effective, since using only high-precision/low-recall patterns like those systems would miss the majority of sparse enterprise entities, while using more low-precision patterns in sparse setting also introduces noise drastically. Second, semantic drift is common in enterprises ("Blue" refers to "Windows Blue"), such that public signals from the web cannot be directly applied on entities. Moreover, many internal entities never appear on the web. Sparse internal signals are the only source for discovering them. To address these challenges, we propose an end-to-end framework for extracting entities in enterprises, taking the input of enterprise corpus and limited seeds to generate a high-quality entity collection as output. We introduce the novel concept of Semantic Pattern Graph to leverage public signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined semantics, and yield more accurate and complete entities. Experiments on Microsoft enterprise data show the effectiveness of our approach.
Density-friendly Graph Decomposition BIBAFull-Text 1089-1099
  Nikolaj Tatti; Aristides Gionis
Decomposing a graph into a hierarchical structure via k-core analysis is a standard operation in any modern graph-mining toolkit. k-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph.
   Despite the fact that k-core analysis relies on vertex degrees, k-cores do not satisfy a certain, rather natural, density property. Simply put, the most central k-core is not necessarily the densest subgraph. This inconsistency between k-cores and graph density provides the basis of our study.
   We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by k-cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph G = (V, E) can be computed in polynomial time. The running time of the exact decomposition algorithm is O(|V|^2|E|) but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the k-core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice k-cores have different structure than locally-dense subgraphs, and as predicted by the theory, k-cores are not always well-aligned with graph density.
Crowd Fraud Detection in Internet Advertising BIBAFull-Text 1100-1110
  Tian Tian; Jun Zhu; Fen Xia; Xin Zhuang; Tong Zhang
The rise of crowdsourcing brings new types of malpractices in Internet advertising. One can easily hire web workers through malicious crowdsourcing platforms to attack other advertisers. Such human generated crowd frauds are hard to detect by conventional fraud detection methods. In this paper, we carefully examine the characteristics of the group behaviors of crowd fraud and identify three persistent patterns, which are moderateness, synchronicity and dispersivity. Then we propose an effective crowd fraud detection method for search engine advertising based on these patterns, which consists of a constructing stage, a clustering stage and a filtering stage. At the constructing stage, we remove irrelevant data and reorganize the click logs into a surfer-advertiser inverted list; At the clustering stage, we define the sync-similarity between surfers' click histories and transform the coalition detection to a clustering problem, solved by a nonparametric algorithm; and finally we build a dispersity filter to remove false alarm clusters. The nonparametric nature of our method ensures that we can find an unbounded number of coalitions with nearly no human interaction. We also provide a parallel solution to make the method scalable to Web data and conduct extensive experiments. The empirical results demonstrate that our method is accurate and scalable.
Provably Fast Inference of Latent Features from Networks: with Applications to Learning Social Circles and Multilabel Classification BIBAFull-Text 1111-1121
  Charalampos Tsourakakis
A well known phenomenon in social networks is homophily, the tendency of agents to connect with similar agents. A derivative of this phenomenon is the emergence of communities. Another phenomenon observed in numerous networks is the existence of certain agents that belong simultaneously to multiple communities. An understanding of these phenomena constitutes a central research topic of network science.
   In this work we focus on a fundamental theoretical question related to the above phenomena with various applications: given an undirected graph G, can we infer efficiently the latent vertex features which explain the observed network structure under the assumption of a generative model that exhibits homophily? We propose a probabilistic generative model with the property that the probability of an edge among two vertices is a non-decreasing function of the common features they possess. This property is true for many real-world networks and surprisingly is ignored by many popular overlapping community detection methods as it was shown recently by the empirical work of Yang and Leskovec [44]. Our main theoretical contribution is the first provably rapidly mixing Markov chain for inferring latent features. On the experimental side, we verify the efficiency of our method in terms of run times, where we observe that it significantly outperforms state-of-the-art methods. Our method is more than 2,400 times faster than a state-of-the-art machine learning method [37] and typically provides non-trivial speedups compared to BigClam [43]. Furthermore, we verify on real-data with ground-truth available that our method learns efficiently high quality labelings. We use our method to learn social circles from Twitter ego-networks and perform multilabel classification.
The K-clique Densest Subgraph Problem BIBAFull-Text 1122-1132
  Charalampos Tsourakakis
Numerous graph mining applications rely on detecting subgraphs which are large near-cliques. Since formulations that are geared towards finding large near-cliques are hard and frequently inapproximable due to connections with the Maximum Clique problem, the poly-time solvable densest subgraph problem which maximizes the average degree over all possible subgraphs "lies at the core of large scale data mining" [10]. However, frequently the densest subgraph problem fails in detecting large near-cliques in networks.
   In this work, we introduce the k-clique densest subgraph problem, k ≥ 2. This generalizes the well studied densest subgraph problem which is obtained as a special case for k=2. For k=3 we obtain a novel formulation which we refer to as the triangle densest subgraph problem: given a graph G(V,E), find a subset of vertices S* such that τ(S*)=max limitsS ⊆ V t(S)/|S|, where t(S) is the number of triangles induced by the set S.
   On the theory side, we prove that for any k constant, there exist an exact polynomial time algorithm for the k-clique densest subgraph problem}. Furthermore, we propose an efficient 1/k-approximation algorithm which generalizes the greedy peeling algorithm of Asahiro and Charikar [8,18] for k=2. Finally, we show how to implement efficiently this peeling framework on MapReduce for any k ≥ 3, generalizing the work of Bahmani, Kumar and Vassilvitskii for the case k=2 [10]. On the empirical side, our two main findings are that (i) the triangle densest subgraph is consistently closer to being a large near-clique compared to the densest subgraph and (ii) the peeling approximation algorithms for both k=2 and k=3 achieve on real-world networks approximation ratios closer to 1 rather than the pessimistic 1/k guarantee. An interesting consequence of our work is that triangle counting, a well-studied computational problem in the context of social network analysis can be used to detect large near-cliques. Finally, we evaluate our proposed method on a popular graph mining application.
GERBIL: General Entity Annotator Benchmarking Framework BIBAFull-Text 1133-1143
  Ricardo Usbeck; Michael Röder; Axel-Cyrille Ngonga Ngomo; Ciro Baron; Andreas Both; Martin Brümmer; Diego Ceccarelli; Marco Cornolti; Didier Cherix; Bernd Eickmann; Paolo Ferragina; Christiane Lemke; Andrea Moro; Roberto Navigli; Francesco Piccinno; Giuseppe Rizzo; Harald Sack; René Speck; Raphaël Troncy; Jörg Waitelonis; Lars Wesemann
We present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of annotation applications. In particular, GERBIL provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in machine-processable format, allowing for the efficient querying and post-processing of evaluation results. Finally, the tool diagnostics provided by GERBIL allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. GERBIL aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results.
An Optimization Framework for Weighting Implicit Relevance Labels for Personalized Web Search BIBAFull-Text 1144-1154
  Yury Ustinovskiy; Gleb Gusev; Pavel Serdyukov
Implicit feedback from users of a web search engine is an essential source providing consistent personal relevance labels from the actual population of users. However, previous studies on personalized search employ this source in a rather straightforward manner. Basically, documents that were clicked on get maximal gain, and the rest of the documents are assigned the zero gain. As we demonstrate in our paper, a ranking algorithm trained using these gains directly as the ground truth relevance labels leads to a suboptimal personalized ranking.
   In this paper we develop a framework for automatic reweighting of these labels. Our approach is based on more subtle aspects of user interaction with the result page. We propose an efficient methodology for deriving confidence levels for relevance labels that relies directly on the objective ranking measure. All our algorithms are evaluated on a large-scale query log provided by a major commercial search engine. The results of the experiments prove that the current state-of-the-art personalization approaches could be significantly improved by enriching relevance grades with weights extracted from post-impression user behavior.
A First Look at Tribal Web Traffic BIBAFull-Text 1155-1165
  Morgan Vigil; Matthew Rantanen; Elizabeth Belding
With broadband penetration rates of less than 10% per capita, Tribal areas in the U.S. represent some of the most underserved communities in terms of Internet access. Although numerous sources have identified this digital divide, there have been no empirical measurements of the performance and usage of services that do exist in these areas. In this paper, we present the characterization of the Tribal Digital Village (TDV) network, a multi-hop wireless network currently connecting 13 reservations in San Diego county. This work represents the first traffic analysis of broadband usage in Tribal lands. After identifying some of the unique purposes of broadband connectivity in indigenous communities, such as language revitalization and cultural development, we focus on the performance of popular applications that enable such activities, including YouTube and Instagram. Though only a fraction of the bandwidth capacity is actually used, 30% of YouTube uploads and 24% of Instagram uploads fail due to packet loss on the relay and access links that connect the reservations to the TDV backbone. Although failure rates are prohibitive to the contribution of locally generated media (particularly videos), our analysis of Instagram media interactions and engagement in the TDV network reveals a high locality of interest. Residents engage with locally created media 8.2 times more than media created by outside sources. Furthermore, locally created media circulates through the network two days longer than non-local media. The results of our analysis point to new directions for increasing content availability on reservations.
A Weighted Correlation Index for Rankings with Ties BIBAFull-Text 1166-1176
  Sebastiano Vigna
Understanding the correlation between two different scores for the same set of items is a common problem in graph analysis and information retrieval. The most commonly used statistics that quantifies this correlation is Kendall's tau; however, the standard definition fails to capture that discordances between items with high rank are more important than those between items with low rank. Recently, a new measure of correlation based on average precision has been proposed to solve this problem, but like many alternative proposals in the literature it assumes that there are no ties in the scores. This is a major deficiency in a number of contexts, and in particular when comparing centrality scores on large graphs, as the obvious baseline, indegree, has a very large number of ties in social networks and web graphs. We propose to extend Kendall's definition in a natural way to take into account weights in the presence of ties. We prove a number of interesting mathematical properties of our generalization and describe an O(n log n) algorithm for its computation. We also validate the usefulness of our weighted measure of correlation using experimental data on social networks and web graphs.
Gathering Additional Feedback on Search Results by Multi-Armed Bandits with Respect to Production Ranking BIBAFull-Text 1177-1187
  Aleksandr Vorobev; Damien Lefortier; Gleb Gusev; Pavel Serdyukov
Given a repeatedly issued query and a document with a not-yet-confirmed potential to satisfy the users' needs, a search system should place this document on a high position in order to gather user feedback and obtain a more confident estimate of the document utility. On the other hand, the main objective of the search system is to maximize expected user satisfaction over a rather long period, what requires showing more relevant documents on average. The state-of-the-art approaches to solving this exploration-exploitation dilemma rely on strongly simplified settings making these approaches infeasible in practice. We improve the most flexible and pragmatic of them to handle some actual practical issues. The first one is utilizing prior information about queries and documents, the second is combining bandit-based learning approaches with a default production ranking algorithm. We show experimentally that our framework enables to significantly improve the ranking of a leading commercial search engine.
The E-Commerce Market for "Lemons": Identification and Analysis of Websites Selling Counterfeit Goods BIBAFull-Text 1188-1197
  John Wadleigh; Jake Drew; Tyler Moore
We investigate the practice of websites selling counterfeit goods. We inspect web search results for 225 queries across 25 brands. We devise a binary classifier that predicts whether a given website is selling counterfeits by examining automatically extracted features such as WHOIS information, pricing and website content. We then apply the classifier to results collected between January and August 2014. We find that, overall, 32% of search results point to websites selling fakes. For 'complicit' search terms, such as "replica Rolex", 39% of the search results point to fakes, compared to 20% for 'innocent' terms, such as "hermes buy online". Using a linear regression, we find that brands with a higher street price for fakes have higher incidence of counterfeits in search results, but that brands who take active countermeasures such as filing DMCA requests experience lower incidence of counterfeits in search results. Finally, we study how the incidence of counterfeits evolves over time, finding that the fraction of search results pointing to fakes remains remarkably stable.
Concept Expansion Using Web Tables BIBAFull-Text 1198-1208
  Chi Wang; Kaushik Chakrabarti; Yeye He; Kris Ganjam; Zhimin Chen; Philip A. Bernstein
We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative examples. They suffer from input ambiguity and semantic drift, or are not viable options for ad-hoc tail concepts. In this paper, we propose to leverage the millions of tables on the web for this problem. The core technical challenge is to identify the "exclusive" tables for a concept to prevent semantic drift; existing holistic ranking techniques like personalized PageRank are inadequate for this purpose. We develop novel probabilistic ranking methods that can model a new type of table-entity relationship. Experiments with real-life concepts show that our proposed solution is significantly more effective than applying state-of-the-art set expansion or holistic ranking techniques.
User Latent Preference Model for Better Downside Management in Recommender Systems BIBAFull-Text 1209-1219
  Jian Wang; David Hardtke
Downside management is an important topic in the field of recommender systems. User satisfaction increases when good items are recommended, but satisfaction drops significantly when bad recommendations are pushed to them. For example, a parent would be disappointed if violent movies are recommended to their kids and may stop using the recommendation system entirely. A vegetarian would feel steak-house recommendations useless. A CEO in a mid-sized company would feel offended by receiving intern-level job recommendations. Under circumstances where there is penalty for a bad recommendation, a bad recommendation is worse than no recommendation at all. While most existing work focuses on upside management (recommending the best items to users), this paper emphasizes on achieving better downside management (reducing the recommendation of irrelevant or offensive items to users). The approach we propose is general and can be applied to any scenario or domain where downside management is key to the system.
   To tackle the problem, we design a user latent preference model to predict the user preference in a specific dimension, say, the dietary restrictions of the user, the acceptable level of adult content in a movie, or the geographical preference of a job seeker. We propose to use multinomial regression as the core model and extend it with a hierarchical Bayesian framework to address the problem of data sparsity. After the user latent preference is predicted, we leverage it to filter out downside items. We validate the soundness of our approach by evaluating it with an anonymous job application dataset on LinkedIn. The effectiveness of the latent preference model was demonstrated in both offline experiments and online A/B testings. The user latent preference model helps to improve the VPI (views per impression) and API (applications per impression) significantly which in turn achieves a higher user satisfaction.
The Role of Data Cap in Optimal Two-part Network Pricing BIBAFull-Text 1220-1230
  Xin Wang; Richard T. B. Ma; Yinlong Xu
Internet services are traditionally priced at flat rates; however, many Internet service providers (ISPs) have recently shifted towards two-part tariffs where a data cap is imposed to restrain data demand from heavy users and usage over the data cap is charged based on a per-unit fee. Although the two-part tariff could generally increase the revenue for ISPs and has been supported by the FCC chairman, the role of data cap and its revenue-optimal and welfare-optimal pricing structures are not well understood. In this paper, we study the impact of data cap on the optimal two-part pricing schemes for congestion-prone service markets, e.g., broadband or cloud services. We model users' demand and preferences over pricing and congestion alternatives and derive the market share and congestion of service providers under a market equilibrium. Based on the equilibrium model, we characterize the two-part structures of the revenue-optimal and welfare-optimal pricing schemes. Our results reveal that 1) the data cap provides a mechanism for ISPs to transition from flat-rate to pay-as-you-go type of schemes, 2) with growing data demand and network capacity, the revenue-optimal pricing moves towards usage-based schemes with diminishing data caps, and 3) the structure of the welfare-optimal tariff comprises lower fees and data cap than those of the revenue-optimal counterpart, suggesting that regulators might want to promote usage-based pricing but regulate the per-unit fees. Our results could help providers design revenue-optimal pricing schemes and guide regulatory authorities to legislate desirable regulations.
Tweeting Cameras for Event Detection BIBAFull-Text 1231-1241
  Yuhui Wang; Mohan S. Kankanhalli
We are living in a world of big sensor data. Due to the widespread prevalence of visual sensors (e.g. surveillance cameras) and social sensors (e.g. Twitter feeds), many events are implicitly captured in real-time by such heterogeneous "sensors". Combining these two complementary sensor streams can significantly improve the task of event detection and aid in comprehending evolving situations. However, the different characteristics of these social and sensor data make such information fusion for event detection a challenging problem. To tackle this problem, we propose an innovative multi-layer tweeting camera framework integrating both physical sensors and social sensors to detect various concepts of real-world events. In this framework, visual concept detectors are applied on camera video frames and these concepts can be construed as "camera tweets" posted regularly. These tweets are represented by a unified probabilistic spatio-temporal (PST) data structure which is then aggregated to a concept-based image (Cmage) as the common representation for visualization. To facilitate event analysis, we define a set of operators and analytic functions that can be applied on the PST data by the user to discover occurrences of events and to analyse evolving situations. We further leverage on geo-located social media data by mining current topics discussed on Twitter to obtain the high-level semantic meaning of detected events in images. We quantitatively evaluate our framework with a large-scale dataset containing images from 150 New York real-time traffic CCTV cameras, university foodcourt camera feeds and Twitter data, which demonstrates the feasibility and effectiveness of the proposed framework. Results of combining camera tweets and social tweets are shown to be promising for detecting real-world events.
Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia BIBAFull-Text 1242-1252
  Robert West; Ashwin Paranjape; Jure Leskovec
Hyperlinks are an essential feature of the World Wide Web. They are especially important for online encyclopedias such as Wikipedia: an article can often only be understood in the context of related articles, and hyperlinks make it easy to explore this context. But important links are often missing, and several methods have been proposed to alleviate this problem by learning a linking model based on the structure of the existing links. Here we propose a novel approach to identifying missing links in Wikipedia. We build on the fact that the ultimate purpose of Wikipedia links is to aid navigation. Rather than merely suggesting new links that are in tune with the structure of existing links, our method finds missing links that would immediately enhance Wikipedia's navigability. We leverage data sets of navigation paths collected through a Wikipedia-based human-computation game in which users must find a short path from a start to a target article by only clicking links encountered along the way. We harness human navigational traces to identify a set of candidates for missing links and then rank these candidates. Experiments show that our procedure identifies missing links of high quality.
Semantic Annotation of Mobility Data using Social Media BIBAFull-Text 1253-1263
  Fei Wu; Zhenhui Li; Wang-Chien Lee; Hongjian Wang; Zhuojie Huang
Recent developments in sensors, GPS and smart phones have provided us with a large amount of mobility data. At the same time, large-scale crowd-generated social media data, such as geo-tagged tweets, provide rich semantic information about locations and events. Combining the mobility data and surrounding social media data enables us to semantically understand why a person travels to a location at a particular time (e.g., attending a local event or visiting a point of interest). Previous research on mobility data mining has been mainly focused on mining patterns using only the mobility data. In this paper, we study the problem of using social media to annotate mobility data. As social media data is often noisy, the key research problem lies in using the right model to retrieve only the relevant words with respect to a mobility record. We propose frequency-based method, Gaussian mixture model, and kernel density estimation (KDE) to tackle this problem. We show that KDE is the most suitable model as it captures the locality of word distribution very well. We test our proposal using the real dataset collected from Twitter and demonstrate the effectiveness of our techniques via both interesting case studies and a comprehensive evaluation.
Automatic Web Content Extraction by Combination of Learning and Grouping BIBAFull-Text 1264-1274
  Shanchan Wu; Jerry Liu; Jian Fan
Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part of actual content, or clipping web pages, has many applications, such as high quality web printing, e-reading on mobile devices and data mining. Although there are many existing methods attempting to address this task, most of them can either work only on certain types of Web pages, e.g. article pages, or has to develop different models for different websites. We formulate the actual content identifying problem as a DOM tree node selection problem. We develop multiple features by utilizing the DOM tree node properties to train a machine learning model. Then candidate nodes are selected based on the learning model. Based on the observation that the actual content is usually located in a spatially continuous block, we develop a grouping technology to further filter out noisy data and pick missing data for the candidate nodes. We conduct extensive experiments on a real dataset and demonstrate our solution has high quality outputs and outperforms several baseline methods.
Executing Provenance-Enabled Queries over Web Data BIBAFull-Text 1275-1285
  Marcin Wylot; Philippe Cudre-Mauroux; Paul Groth
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Understanding Malvertising Through Ad-Injecting Browser Extensions BIBAFull-Text 1286-1295
  Xinyu Xing; Wei Meng; Byoungyoung Lee; Udi Weinsberg; Anmol Sheth; Roberto Perdisci; Wenke Lee
Malvertising is a malicious activity that leverages advertising to distribute various forms of malware. Because advertising is the key revenue generator for numerous Internet companies, large ad networks, such as Google, Yahoo and Microsoft, invest a lot of effort to mitigate malicious ads from their ad networks. This drives adversaries to look for alternative methods to deploy malvertising. In this paper, we show that browser extensions that use ads as their monetization strategy often facilitate the deployment of malvertising. Moreover, while some extensions simply serve ads from ad networks that support malvertising, other extensions maliciously alter the content of visited webpages to force users into installing malware. To measure the extent of these behaviors we developed Expector, a system that automatically inspects and identifies browser extensions that inject ads, and then classifies these ads as malicious or benign based on their landing pages. Using Expector, we automatically inspected over 18,000 Chrome browser extensions. We found 292 extensions that inject ads, and detected 56 extensions that participate in malvertising using 16 different ad networks and with a total user base of 602,417.
E-commerce Reputation Manipulation: The Emergence of Reputation-Escalation-as-a-Service BIBAFull-Text 1296-1306
  Haitao Xu; Daiping Liu; Haining Wang; Angelos Stavrou
In online markets, a store's reputation is closely tied to its profitability. Sellers' desire to quickly achieve high reputation has fueled a profitable underground business, which operates as a specialized crowdsourcing marketplace and accumulates wealth by allowing online sellers to harness human laborers to conduct fake transactions for improving their stores' reputations. We term such an underground market a seller-reputation-escalation (SRE) market. In this paper, we investigate the impact of the SRE service on reputation escalation by performing in-depth measurements of the prevalence of the SRE service, the business model and market size of SRE markets, and the characteristics of sellers and offered laborers. To this end, we have infiltrated five SRE markets and studied their operations using daily data collection over a continuous period of two months. We identified more than 11,000 online sellers posting at least 219,165 fake-purchase tasks on the five SRE markets. These transactions earned at least $46,438 in revenue for the five SRE markets, and the total value of merchandise involved exceeded $3,452,530. Our study demonstrates that online sellers using SRE service can increase their stores' reputations at least 10 times faster than legitimate ones while only 2.2% of them were detected and penalized. Even worse, we found a newly launched service that can, within a single day, boost a seller's reputation by such a degree that would require a legitimate seller at least a year to accomplish. Finally, armed with our analysis of the operational characteristics of the underground economy, we offer some insights into potential mitigation strategies.
Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation BIBAFull-Text 1307-1317
  Da Yan; James Cheng; Yi Lu; Wilfred Ng
Massive graphs, such as online social networks and communication networks, have become common today. To efficiently analyze such large graphs, many distributed graph computing systems have been developed. These systems employ the "think like a vertex" programming paradigm, where a program proceeds in iterations and at each iteration, vertices exchange messages with each other. However, using Pregel's simple message passing mechanism, some vertices may send/receive significantly more messages than others due to either the high degree of these vertices or the logic of the algorithm used. This forms the communication bottleneck and leads to imbalanced workload among machines in the cluster. In this paper, we propose two effective message reduction techniques: (1) vertex mirroring with message combining, and (2) an additional request-respond API. These techniques not only reduce the total number of messages exchanged through the network, but also bound the number of messages sent/received by any single vertex. We theoretically analyze the effectiveness of our techniques, and implement them on top of our open-source Pregel implementation called Pregel+. Our experiments on various large real graphs demonstrate that our message reduction techniques significantly improve the performance of distributed graph computation.
Tackling the Achilles Heel of Social Networks: Influence Propagation based Language Model Smoothing BIBAFull-Text 1318-1328
  Rui Yan; Ian E. H. Yen; Cheng-Te Li; Shiqi Zhao; Xiaohua Hu
Online social networks nowadays enjoy their worldwide prosperity, as they have revolutionized the way for people to discover, to share, and to distribute information. With millions of registered users and the proliferation of user-generated contents, the social networks become "giants", likely eligible to carry on any research tasks. However, the giants do have their Achilles Heel: extreme data sparsity. Compared with the massive data over the whole collection, individual posting documents, (e.g., a microblog less than 140 characters), seem to be too sparse to make a difference under various research scenarios, while actually they are different. In this paper we propose to tackle the Achilles Heel of social networks by smoothing the language model via influence propagation. We formulate a socialized factor graph model, which utilizes both the textual correlations between document pairs and the socialized augmentation networks behind the documents, such as user relationships and social interactions. These factors are modeled as attributes and dependencies among documents and their corresponding users. An efficient algorithm is designed to learn the proposed factor graph model. Finally we propagate term counts to smooth documents based on the estimated influence. Experimental results on Twitter and Weibo datasets validate the effectiveness of the proposed model. By leveraging the smoothed language model with social factors, our approach obtains significant improvement over several alternative methods on both intrinsic and extrinsic evaluations measured in terms of perplexity, nDCG and MAP results.
A Game Theoretic Model for the Formation of Navigable Small-World Networks BIBAFull-Text 1329-1339
  Zhi Yang; Wei Chen
Kleinberg proposed a family of small-world networks to explain the navigability of large-scale real-world social networks. However, the underlying mechanism that drives real networks to be navigable is not yet well understood. In this paper, we present a game theoretic model for the formation of navigable small world networks. We model the network formation as a game in which people seek for both high reciprocity and long-distance relationships. We show that the navigable small-world network is a Nash Equilibrium of the game. Moreover, we prove that the navigable small-world equilibrium tolerates collusions of any size and arbitrary deviations of a large random set of nodes, while non-navigable equilibria do not tolerate small group collusions or random perturbations. Our empirical evaluation further demonstrates that the system always converges to the navigable network even when limited or no information about other players' strategies is available. Our theoretical and empirical analyses provide important new insight on the connection between distance, reciprocity and navigability in social networks.
A Scalable Asynchronous Distributed Algorithm for Topic Modeling BIBAFull-Text 1340-1350
  Hsiang-Fu Yu; Cho-Jui Hsieh; Hyokun Yun; S. V. N. Vishwanathan; Inderjit S. Dhillon
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of Yun et al, 2014. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
LightLDA: Big Topic Models on Modest Computer Clusters BIBAFull-Text 1351-1361
  Jinhui Yuan; Fei Gao; Qirong Ho; Wei Dai; Jinliang Wei; Xun Zheng; Eric Po Xing; Tie-Yan Liu; Wei-Ying Ma
When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limited memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These contributions are built on top of the Petuum open-source distributed ML framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.
A Novelty-Seeking based Dining Recommender System BIBAFull-Text 1362-1372
  Fuzheng Zhang; Kai Zheng; Nicholas Jing Yuan; Xing Xie; Enhong Chen; Xiaofang Zhou
The rapid growth of location-based services provide the potential to understand people's mobility pattern at an unprecedented level, which can also enable food-service industry to accurately predict consumer's dining behavior. In this paper, by leveraging users' historical dining pattern, socio-demographic characteristics and restaurants' attributes, we aim at generating the top-K restaurants for a user's next dining. Compared to previous studies in location prediction which mainly focus on regular mobility patterns, we present a novelty-seeking based dining recommender system, termed NDRS, in consideration of both exploration and exploitation. First, we apply a Conditional Random Field (CRF) with additional constraints to infer users' novelty-seeking statuses by considering both spatial-temporal-historical features and users' socio-demographic characteristics. On the one hand, when a user is predicted to be novelty-seeking, by incorporating the influence of restaurants' contextual factors such as price and service quality, we propose a context-aware collaborative filtering method to recommend restaurants she has never visited before. On the other hand, when a user is predicted to be not novelty-seeking, we then present a Hidden Markov Model (HMM) considering the temporal regularity to recommend the previously visited restaurants. To evaluate the performance of each component as well as the whole system, we conduct extensive experiments, with a large dataset we have collected covering the concerned dining related check-ins, users' demographics, and restaurants' attributes. The results reveal that our system is effective for dining recommendation.
Daily-Aware Personalized Recommendation based on Feature-Level Time Series Analysis BIBAFull-Text 1373-1383
  Yongfeng Zhang; Min Zhang; Yi Zhang; Guokun Lai; Yiqun Liu; Honghui Zhang; Shaoping Ma
The frequently changing user preferences and/or item profiles have put essential importance on the dynamic modeling of users and items in personalized recommender systems. However, due to the insufficiency of per user/item records when splitting the already sparse data across time dimension, previous methods have to restrict the drifting purchasing patterns to pre-assumed distributions, and were hardly able to model them rather directly with, for example, time series analysis. Integrating content information helps to alleviate the problem in practical systems, but the domain-dependent content knowledge is expensive to obtain due to the large amount of manual efforts.
   In this paper, we make use of the large volume of textual reviews for the automatic extraction of domain knowledge, namely, the explicit features/aspects in a specific product domain. We thus degrade the product-level modeling of user preferences, which suffers from the lack of data, to the feature-level modeling, which not only grants us the ability to predict user preferences through direct time series analysis, but also allows us to know the essence under the surface of product-level changes in purchasing patterns. Besides, the expanded feature space also helps to make cold-start recommendations for users with few purchasing records.
   Technically, we develop the Fourier-assisted Auto-Regressive Integrated Moving Average (FARIMA) process to tackle with the year-long seasonal period of purchasing data to achieve daily-aware preference predictions, and we leverage the conditional opportunity models for daily-aware personalized recommendation. Extensive experimental results on real-world cosmetic purchasing data from a major e-commerce website (JD.com) in China verified both the effectiveness and efficiency of our approach.
Automatic Detection of Information Leakage Vulnerabilities in Browser Extensions BIBAFull-Text 1384-1394
  Rui Zhao; Chuan Yue; Qing Yi
A large number of extensions exist in browser vendors' online stores for millions of users to download and use. Many of those extensions process sensitive information from user inputs and webpages; however, it remains a big question whether those extensions may accidentally leak such sensitive information out of the browsers without protection. In this paper, we present a framework, LvDetector, that combines static and dynamic program analysis techniques for automatic detection of information leakage vulnerabilities in legitimate browser extensions. Extension developers can use LvDetector to locate and fix the vulnerabilities in their code; browser vendors can use LvDetector to decide whether the corresponding extensions can be hosted in their online stores; advanced users can also use LvDetector to determine if certain extensions are safe to use. The design of LvDetector is not bound to specific browsers or JavaScript engines, and can adopt other program analysis techniques. We implemented LvDetector and evaluated it on 28 popular Firefox and Google Chrome extensions. LvDetector identified 18 previously unknown information leakage vulnerabilities in 13 extensions with a 87% accuracy rate. The evaluation results and the feedback to our responsible disclosure demonstrate that LvDetector is useful and effective.
Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts BIBAFull-Text 1395-1405
  Zhe Zhao; Paul Resnick; Qiaozhu Mei
Many previous techniques identify trending topics in social media, even topics that are not pre-defined. We present a technique to identify trending rumors, which we define as topics that include disputed factual claims. Putting aside any attempt to assess whether the rumors are true or false, it is valuable to identify trending rumors as early as possible. It is extremely difficult to accurately classify whether every individual post is or is not making a disputed factual claim. We are able to identify trending rumors by recasting the problem as finding entire clusters of posts whose topic is a disputed factual claim.
   The key insight is that when there is a rumor, even though most posts do not raise questions about it, there may be a few that do. If we can find signature text phrases that are used by a few people to express skepticism about factual claims and are rarely used to express anything else, we can use those as detectors for rumor clusters. Indeed, we have found a few phrases that seem to be used exactly that way, including: "Is this true?", "Really?", and "What?". Relatively few posts related to any particular rumor use any of these enquiry phrases, but lots of rumor diffusion processes have some posts that do and have them quite early in the diffusion.
   We have developed a technique based on searching for the enquiry phrases, clustering similar posts together, and then collecting related posts that do not contain these simple phrases. We then rank the clusters by their likelihood of really containing a disputed factual claim. The detector, which searches for the very rare but very informative phrases, combined with clustering and a classifier on the clusters, yields surprisingly good performance. On a typical day of Twitter, about a third of the top 50 clusters were judged to be rumors, a high enough precision that human analysts might be willing to sift through them.
Improving User Topic Interest Profiles by Behavior Factorization BIBAFull-Text 1406-1416
  Zhe Zhao; Zhiyuan Cheng; Lichan Hong; Ed H. Chi
Many recommenders aim to provide relevant recommendations to users by building personal topic interest profiles and then using these profiles to find interesting contents for the user. In social media, recommender systems build user profiles by directly combining users' topic interest signals from a wide variety of consumption and publishing behaviors, such as social media posts they authored, commented on, +1'd or liked. Here we propose to separately model users' topical interests that come from these various behavioral signals in order to construct better user profiles. Intuitively, since publishing a post requires more effort, the topic interests coming from publishing signals should be more accurate of a user's central interest than, say, a simple gesture such as a +1. By separating a single user's interest profile into several behavioral profiles, we obtain better and cleaner topic interest signals, as well as enabling topic prediction for different types of behavior, such as topics that the user might +1 or comment on, but might never write a post on that topic.
   To do this at large scales in Google+, we employed matrix factorization techniques to model each user's behaviors as a separate example entry in the input user-by-topic matrix. Using this technique, which we call "behavioral factorization", we implemented and built a topic recommender predicting user's topical interests using their actions within Google+. We experimentally showed that we obtained better and cleaner signals than baseline methods, and are able to more accurately predict topic interests as well as achieve better coverage.
Predicting Pinterest: Automating a Distributed Human Computation BIBAFull-Text 1417-1426
  Changtao Zhong; Dmytro Karamshuk; Nishanth Sastry
Everyday, millions of users save content items for future use on sites like Pinterest, by ''pinning'' them onto carefully categorised personal pinboards, thereby creating personal taxonomies of the Web. This paper seeks to understand Pinterest as a distributed human computation that categorises images from around the Web. We show that despite being categorised onto personal pinboards by individual actions, there is a generally a global agreement in implicitly assigning images into a coarse-grained global taxonomy of 32 categories, and furthermore, users tend to specialise in a handful of categories. By exploiting these characteristics, and augmenting with image-related features drawn from a state-of-the-art deep convolutional neural network, we develop a cascade of predictors that together automate a large fraction of Pinterest actions. Our end-to-end model is able to both predict whether a user will repin an image onto her own pinboard, and also which pinboard she might choose, with an accuracy of 0.69 (Accuracy@5 of 0.75).

WWW 2015-05-18 Volume 2


Ads Keyword Rewriting Using Search Engine Results BIBAFull-Text 3-4
  Javad Azimi; Adnan Alam; Ruofei Zhang
Paid Search (PS) ads are one of the main revenue sources of online advertising companies where the goal is returning a set of relevant ads for a searched query in search engine websites such as Bing. Typical PS algorithms, return the ads which their Bided Keywords (BKs) are a subset of searched queries or relevant to them. However, there is a huge gap between BKs and searched queries as a considerable amount of BKs are rarely searched by the users. This is mostly due to the rare BKs provided by advertisers. In this paper, we propose an approach to rewrite the rare BKs to more commonly searched keywords, without compromising the original BKs intent, which increases the coverage and depth of PS ads and thus it delivers higher monetization power. In general, we first find the relevant web documents pertaining to the BKs and then extract common keywords using the web doc title and its summary snippets. Experimental results show the effectiveness of proposed algorithm in rewriting rare BKs and consequently providing us with a significant improvement in recall and thereby revenue.
Abstractive Meeting Summarization Using Dependency Graph Fusion BIBAFull-Text 5-6
  Siddhartha Banerjee; Prasenjit Mitra; Kazunari Sugiyama
Automatic summarization techniques on meeting conversations developed so far have been primarily extractive, resulting in poor summaries. To improve this, we propose an approach to generate abstractive summaries by fusing important content from several utterances. Any meeting is generally comprised of several discussion topic segments. For each topic segment within a meeting conversation, we aim to generate a one sentence summary from the most important utterances using an integer linear programming-based sentence fusion approach. Experimental results show that our method can generate more informative summaries than the baselines.
Towards Semantic Retrieval of Hashtags in Microblogs BIBAFull-Text 7-8
  Piyush Bansal; Somay Jain; Vasudeva Varma
On various microblogging platforms like Twitter, the users post short text messages ranging from news and information to thoughts and daily chatter. These messages often contain keywords called Hashtags, which are semantico-syntactic constructs that enable topical classification of the microblog posts. In this poster, we propose and evaluate a novel method of semantic enrichment of microblogs for a particular type of entity search -- retrieving a ranked list of the top-k hashtags relevant to a user's query Q. Such a list can help the users track posts of their general interest. We show that our technique significantly improved microblog retrieval as well. We tested our approach on the publicly available Stanford sentiment analysis tweet corpus. We observed an improvement of more than 10% in NDCG for microblog retrieval task, and around 11% in mean average precision for hashtag retrieval task.
Modeling and Predicting Popularity Dynamics of Microblogs using Self-Excited Hawkes Processes BIBAFull-Text 9-10
  Peng Bao; Hua-Wei Shen; Xiaolong Jin; Xue-Qi Cheng
The ability to model and predict the popularity dynamics of individual user generated items on online media has important implications in a wide range of areas. In this paper, we propose a probabilistic model using a Self-Excited Hawkes Process (SEHP) to characterize the process through which individual microblogs gain their popularity. This model explicitly captures the triggering effect of each forwarding, distinguishing itself from the reinforced Poisson process based model where all previous forwardings are simply aggregated as a single triggering effect. We validate the proposed model by applying it on Sina Weibo, the most popular microblogging network in China. Experimental results demonstrate that the SEHP model consistently outperforms the model based on reinforced Poisson process.
Evaluating User Targeting Policies: Simulation Based on Randomized Experiment Data BIBAFull-Text 11-12
  Joel Barajas; Ram Akella; Marius Holtan
We propose a user targeting simulator for online display advertising. Based on the response of 37 million visiting users (targeted and non-targeted) and their features, we simulate different user targeting policies. We provide evidence that the standard conversion optimization policy shows similar effectiveness to that of a random targeting, and significantly inferior to other causally optimized targeting policies.
A Comparison of Supervised Keyphrase Extraction Models BIBAFull-Text 13-14
  Florin Bulgarov; Cornelia Caragea
Keyphrases for a document provide a high-level topic description of the document. Given the number of documents growing exponentially on the Web in the past years, accurate methods for extracting keyphrases from such documents are greatly needed. In this study, we provide a comparison of existing supervised approaches to this task to determine the current best performing model. We use research articles on the Web as the case study.
ControVol: Let Yesterday's Data Catch Up with Today's Application Code BIBAFull-Text 15-16
  Thomas Cerqueus; Eduardo Cunha de Almeida
In building software-as-a-service applications, a flexible development environment is key to shipping early and often. Therefore, schema-flexible data stores are becoming more and more popular. They can store data with heterogeneous structure, allowing for new releases to be pushed frequently, without having to migrate legacy data first. However, the current application code must continue to work with any legacy data that has already been persisted in production. To let legacy data structurally "catch up" with the latest application code, developers commonly employ object mapper libraries with life-cycle annotations. Yet when used without caution, they can cause runtime errors and even data loss. We present ControVol, an IDE plugin that detects evolutionary changes to the application code that are incompatible with legacy data. ControVol warns developers already at development time, and even suggests automatic fixes for lazily migrating legacy data when it is loaded into the application. Thus, ControVol ensures that the structure of legacy data can catch up with the structure expected by the latest software release.
Dataset Descriptions for Optimizing Federated Querying BIBAFull-Text 17-18
  Angelos Charalambidis; Stasinos Konstantopoulos; Vangelis Karkaletsis
Dataset description vocabularies focus on provenance, versioning, licensing, and similar metadata. VoID is a notable exception, providing some expressivity for describing subsets and their contents and can, to some extent, be used for discovering relevant resources and for optimizing querying. In this poster we describe an extension of VoID that provides the expressivity needed in order to support the query planning methods typically used in federated querying.
Online Learning to Rank: Absolute vs. Relative BIBAFull-Text 19-20
  Yiwei Chen; Katja Hofmann
Online learning to rank holds great promise for learning personalized search result rankings. First algorithms have been proposed, namely absolute feedback approaches, based on contextual bandits learning; and relative feedback approaches, based on gradient methods and inferred preferences between complete result rankings. Both types of approaches have shown promise, but they have not previously been compared to each other. It is therefore unclear which type of approach is the most suitable for which online learning to rank problems. In this work we present the first empirical comparison of absolute and relative online learning to rank approaches.
Mouse Clicks Can Recognize Web Page Visitors! BIBAFull-Text 21-22
  Daniela Chuda; Peter Kratky; Jozef Tvarozek
Behavioral biometrics based on mouse usage can be used to recognize one's identity, with special applications in anonymous Web browsing. Out of many features that describe browsing behavior, mouse clicks (or touches) as the most basic of navigation actions, provide a stable stream of behavioral data. The paper describes a method to recognize Web user according to three click features. The distance-based classification comparing cumulative distribution functions achieves high recognition accuracy even with hundreds of users.
Geo Data Annotator: a Web Framework for Collaborative Annotation of Geographical Datasets BIBAFull-Text 23-24
  Stefano Cresci; Davide Gazzè; Angelica Lo Duca; Andrea Marchetti; Maurizio Tesconi
In this paper we illustrate the Geo Data Annotator (GDA), a framework which helps a user to build a ground-truth dataset, starting from two separate geographical datasets. GDA exploits two kinds of indices to ease the task of manual annotation: geographical-based and string-based. GDA provides also a mechanism to evaluate the quality of the built ground-truth dataset. This is achieved through a collaborative platform, which allows many users to work to the same project. The quality evaluation is based on annotator agreement, which exploits the Fleiss' kappa statistic.
Online View Maintenance for Continuous Query Evaluation BIBAFull-Text 25-26
  Soheila Dehghanzadeh; Alessandra Mileo; Daniele Dell'Aglio; Emanuele Della Valle; Shen Gao; Abraham Bernstein
In Web stream processing, there are queries that integrate Web data of various velocity, categorized broadly as streaming (i.e., fast changing) and background (i.e., slow changing) data. The introduction of local views on the background data speeds up the query answering process, but requires maintenance processes to keep the replicated data up-to-date. In this work, we study the problem of maintaining local views in a Web setting, where background data are usually stored remotely, are exposed through services with constraints on the data access (e.g., invocation rate limits and data access patterns) and, contrary to the database setting, do not provide streams with changes over their content. Then, we propose an initial solution: WBM, a method to maintain the content of the view with regards to query and user-defined constraints on accuracy and responsiveness.
FedWeb Greatest Hits: Presenting the New Test Collection for Federated Web Search BIBAFull-Text 27-28
  Thomas Demeester; Dolf Trieschnigg; Dong Nguyen; Djoerd Hiemstra; Ke Zhou
This paper presents 'FedWeb Greatest Hits', a large new test collection for research in web information retrieval. As a combination and extension of the datasets used in the TREC Federated Web Search Track, this collection opens up new research possibilities on federated web search challenges, as well as on various other problems.
Hate Speech Detection with Comment Embeddings BIBAFull-Text 29-30
  Nemanja Djuric; Jing Zhou; Robin Morris; Mihajlo Grbovic; Vladan Radosavljevic; Narayan Bhamidipati
We address the problem of hate speech detection in online user comments. Hate speech, defined as an "abusive speech targeting specific group characteristics, such as ethnicity, religion, or gender", is an important problem plaguing websites that allow users to leave feedback, having a negative impact on their online business and overall user experience. We propose to learn distributed low-dimensional representations of comments using recently proposed neural language models, that can then be fed as inputs to a classification algorithm. Our approach addresses issues of high-dimensionality and sparsity that impact the current state-of-the-art, resulting in highly efficient and effective hate speech detectors.
What's Hot in The Theme: Query Dependent Emerging Topic Extraction from Social Streams BIBAFull-Text 31-32
  Yuki Endo; Hiroyuki Toda; Yoshimasa Koike
Analyzing emerging topics from social media enables users to overview social movement and several web services to adopt current trends. Although existing studies mainly focus on extracting global emerging topics, efficient extraction of local ones related to a specific theme is still a challenging and unavoidable problem in social media analysis. We focus on extracting emerging social topics related to user-specified query words, and propose an extraction framework that uses a non-negative matrix factorization (NMF) modified for detecting temporal concentration and reducing noises. We conduct preliminary experiments for verifying our method using a Twitter dataset.
Fast Search for Distance Dependent Chinese Restaurant Processes BIBAFull-Text 33-34
  Weiwei Feng; Peng Wang; Chuan Zhou; Li Guo; Peng Zhang
The distance dependent Chinese Restaurant Processes (dd-CRP), a nonparametric Bayesian model, can model distance sensitive data. Existing inference algorithms for dd-CRP, such as Markov Chain Monte Carlo (MCMC) and variational algorithms, are inefficient and unable to handle massive online data, because posterior distributions of dd-CRP are not marginal invariant. To solve this problem, we present a fast inference algorithm for dd-CRP based on the A-star search. Experimental results show that the new search algorithm is faster than existing dd-CRP inference algorithms with comparable results.
ASIM: A Scalable Algorithm for Influence Maximization under the Independent Cascade Model BIBAFull-Text 35-36
  Sainyam Galhotra; Akhil Arora; Srinivas Virinchi; Shourya Roy
The steady growth of graph data from social networks has resulted in wide-spread research in finding solutions to the influence maximization problem. Although, TIM is one of the fastest existing algorithms, it cannot be deemed scalable owing to its exorbitantly high memory footprint.cIn this paper, we address the scalability aspect -- memory consumption and running time of the influence maximization problem. We propose ASIM, a scalable algorithm capable of running within practical compute times on commodity hardware. Empirically, ASIM is 6-8 times faster when compared to CELF++ with similar memory consumption, while its memory footprint is ≈200 times smaller when compared to TIM.
Search Retargeting using Directed Query Embeddings BIBAFull-Text 37-38
  Mihajlo Grbovic; Nemanja Djuric; Vladan Radosavljevic; Narayan Bhamidipati
Determining user audience for online ad campaigns is a critical problem to companies competing in online advertising space. One of the most popular strategies is search retargeting, which involves targeting users that issued search queries related to advertiser's core business, commonly specified by advertisers themselves. However, advertisers often fail to include many relevant queries, which results in suboptimal campaigns and negatively impacts revenue for both advertisers and publishers. To address this issue, we use recently proposed neural language models to learn low-dimensional, distributed query embeddings, which can be used to expand query lists with related queries through simple nearest neighbor searches in the embedding space. Experiments on real-world data set strongly suggest benefits of the approach.
Identifying Successful Investors in the Startup Ecosystem BIBAFull-Text 39-40
  Srishti Gupta; Robert Pienta; Acar Tamersoy; Duen Horng Chau; Rahul C. Basole
Who can spot the next Google, Facebook, or Twitter? Who can discover the next billion-dollar startups? Measuring investor success is a challenging task, as investment strategies can vary widely. We propose InvestorRank, a novel method for identifying successful investors by analyzing how an investor's collaboration network change over time. InvestorRank captures the intuition that a successful investor achieves increasingly success in spotting great startups, or is able to keep doing so persistently. Our results show potential in discovering relatively unknown investors that may be the success stories of tomorrow.
Towards Serving "Delicious" Information within Its Freshness Date BIBAFull-Text 41-42
  Hao Han; Takashi Nakayama; Junxia Guo; Keizo Oyama
Like freshness date of food, Web information also has its "shelf life". In this paper, we exploratively study the reflection of shelf life of information in browsing behaviors. Our analysis shows that the satisfaction of browsing behavior is modified if the shelf life of information could be considered by search engines.
FluTCHA: Using Fluency to Distinguish Humans from Computers BIBAFull-Text 43-44
  Kotaro Hara; Mohammad T. Hajiaghayi; Benjamin B. Bederson
Improvements in image understanding technologies are making it possible for computers to pass traditional CAPTCHA tests with high probability. This suggests the need for new kinds of tasks that are easy to accomplish for humans but remain difficult for computers. In this paper, we introduce Fluency CAPTCHA (FluTCHA), a novel method to distinguish humans from computers using the fact that humans are better than machines at improving the fluency of sentences. We propose a way to let users work on FluTCHA tests and simultaneously complete useful linguistic tasks. Evaluation demonstrates the feasibility of using FluTCHA to distinguish humans from computers.
Online Event Recommendation for Event-based Social Networks BIBAFull-Text 45-46
  Xiancai Ji; Zhi Qiao; Mingze Xu; Peng Zhang; Chuan Zhou; Li Guo
With the rapid growth of event-based social networks, the demand of event recommendation becomes increasingly important. While, the existing event recommendation approaches are batch learning fashion. Such approaches are impractical for real-world recommender systems where training data often arrive sequentially. Hence, we present an online event recommendation method. Experimental results on several real-world datasets demonstrate the utility of our method.
Entity-driven Type Hierarchy Construction for Freebase BIBAFull-Text 47-48
  Jyun-Yu Jiang; Pu-Jen Cheng; Chin-Yew Lin
The hierarchical structure of a knowledge base system can lead to various valuable applications; however, many knowledge base systems do not have such property. In this paper, we propose an entity-driven approach to automatically construct the hierarchical structure of entities for knowledge base systems. By deriving type dependencies from entity information, the initial graph of types will be constructed, and then modified to become a hierarchical structure by several graph algorithms. Experimental results show the effectiveness of our method in terms of constructing reasonable type hierarchy for knowledge base systems.
Multi-Aspect Collaborative Filtering based on Linked Data for Personalized Recommendation BIBAFull-Text 49-50
  Han-Gyu Ko; Joo-Sik Son; In-Young Ko
Since users often consider more than one aspect when they choose an item, relevant researches introduced multi-criteria recommender systems and showed that multi-criteria ratings add values to the existing CF-based recommender systems to provide more accurate recommendation results to users. However, all the previous works require multi-criteria ratings given by users explicitly while most of the existing datasets such as Netflix and MovieLens are a single criterion. Therefore, to take advantage of multi-criteria recommendation, there must be a way to extract necessary aspects and analyze users' preferences on those aspects from the given single-criterion type of dataset. In this paper, we propose an approach of utilizing semantic information of items to extract essential aspects to perform multi-aspect collaborative filtering to recommend users with items in a personalized manner.
Extracting Taxonomies from Bipartite Graphs BIBAFull-Text 51-52
  Tobias Kötter; Stephan Günnemann; Christos Faloutsos; Michael R. Berthold
Given a large bipartite graph that represents objects and their properties, how can we automatically extract semantic information that provides an overview of the data and -- at the same time -- enables us to drill down to specific parts for an in-depth analysis? In this work in-progress paper, we propose extracting a taxonomy that models the relation between the properties via an is-a hierarchy. The extracted taxonomy arranges the properties from general to specific providing different levels of abstraction.
Tweet-Recommender: Finding Relevant Tweets for News Articles BIBAFull-Text 53-54
  Ralf Krestel; Thomas Werkmeister; Timur Pratama Wiradarma; Gjergji Kasneci
Twitter has become a prime source for disseminating news and opinions. However, the length of tweets prohibits detailed descriptions; instead, tweets sometimes contain URLs that link to detailed news articles. In this paper, we devise generic techniques for recommending tweets for any given news article. To evaluate and compare the different techniques, we collected tens of thousands of tweets and news articles and conducted a user study on the relevance of recommendations.
Temporality in Online Food Recipe Consumption and Production BIBAFull-Text 55-56
  Tomasz Kusmierczyk; Christoph Trattner; Kjetil Nørvåg
In this paper, we present work-in-progress of a recently started research effort that aims at understanding the hidden temporal dynamics in online food communities. In this context, we have mined and analyzed temporal patterns in terms of recipe production and consumption in a large German community platform. As our preliminary results reveal, there are indeed a range of hidden temporal patterns in terms of food preferences and in particular in consumption and production. We believe that this kind of research can be important for future work in personalized Web-based information access and in particular recommender systems.
Finding the Differences between the Perceptions of Experts and the Public in the Field of Diabetes BIBAFull-Text 57-58
  Dahee Lee; Won Chul Kim; Min Song
Automatic information extraction techniques such as named entity recognition and relation extraction have been developed but it is yet rare to apply them to various document types. In this paper, we applied them to academic literature and social media's contents in the field of diabetes to find distinctions between the perceptions of biomedical experts and the public. We analyzed and compared the experts' and the public's networks constituted by the extracted entities and relations. The results confirmed that there are some differences in their views, i.e., biomedical entities that interest them and relations within their knowledge range.
A Graph-Based Recommendation Framework for Price-Comparison Services BIBAFull-Text 59-60
  Sang-Chul Lee; Sang-Wook Kim; Sunju Park
In this paper, we propose a set of recommendation strategies and develop a graph-based framework for recommendation in online price-comparison services. We verify the superiority of the proposed framework by comparing it with existing methods using real-world data.
A Descriptive Analysis of a Large-Scale Collection of App Management Activities BIBAFull-Text 61-62
  Huoran Li; Xuanzhe Liu; Wei Ai; Qiaozhu Mei; Feng Feng
Smartphone users adopt an increasing number of mobile applications (a.k.a., apps) in the recent years. Investigating how people manage mobile apps in their everyday lives creates a unique opportunity to understand the behaviors and preferences of mobile users. Existing literature provides very limited understanding about app management activities, due to the lack of user behavioral data at scale. This paper analyzes a very large collection of app management log of the users of a leading Android app marketplace in China. The data set covers one month of detailed activities of how users download, update, and uninstall the apps on their smart devices, involving 8,306,181 anonymized users and 394,661 apps. We characterize how these users manage the apps on their devices and identify behavioral patterns that correlate with users' online ratings of the apps.
Feature Selection for Sentiment Classification Using Matrix Factorization BIBAFull-Text 63-64
  Jiguang Liang; Xiaofei Zhou; Li Guo; Shuo Bai
Feature selection is a critical task in both sentiment classification and topical text classification. However, most existing feature selection algorithms ignore a significant contextual difference between them that sentiment classification is commonly depended more on the words conveying sentiments. Based on this observation, a new feature selection method based on matrix factorization is proposed to identify the words with strong inter-sentiment distinguish-ability and intra-sentiment similarity. Furthermore, experiments show that our models require less features while still maintaining reasonable classification accuracy.
Inferring and Exploiting Categories for Next Location Prediction BIBAFull-Text 65-66
  Ankita Likhyani; Deepak Padmanabhan; Srikanta Bedathur; Sameep Mehta
Predicting the next location of a user based on their previous visiting pattern is one of the primary tasks over data from location based social networks (LBSNs) such as Foursquare. Many different aspects of these so-called "check-in" profiles of a user have been made use of in this task, including spatial and temporal information of check-ins as well as the social network information of the user. Building more sophisticated prediction models by enriching these check-in data by combining them with information from other sources is challenging due to the limited data that these LBSNs expose due to privacy concerns. In this paper, we propose a framework to use the location data from LBSNs, combine it with the data from maps for associating a set of venue categories with these locations. For example, if the user is found to be checking in at a mall that has cafes, cinemas and restaurants according to the map, all these information is associated. This category information is then leveraged to predict the next checkin location by the user. Our experiments with publicly available check-in dataset show that this approach improves on the state-of-the-art methods for location prediction.
A Word Vector and Matrix Factorization Based Method for Opinion Lexicon Extraction BIBAFull-Text 67-68
  Zheng Lin; Weiping Wang; Xiaolong Jin; Jiguang Liang; Dan Meng
Automatic opinion lexicon extraction has attracted lots of attention and many methods have thus been proposed. However, most existing methods depend on dictionaries (e.g., WordNet), which confines their applicability. For instance, the dictionary based methods are unable to find domain dependent opinion words, because the entries in a dictionary are usually domain-independent. There also exist corpus-based methods that directly extract opinion lexicons from reviews. However, they heavily rely on sentiment seed words that have limited sentiment information and the context information has not been fully considered. To overcome these problems, this paper presents a word vector and matrix factorization based method for automatically extracting opinion lexicons from reviews of different domains and further identifying the sentiment polarities of the words. Experiments on real datasets demonstrate that the proposed method is effective and performs better than the state-of-the-art methods.
Collaborative Datasets Retrieval for Interlinking on Web of Data BIBAFull-Text 69-70
  Haichi Liu; Jintao Tang; Dengping Wei; Peilei Liu; Hong Ning; Ting Wang
Dataset interlinking is a great important problem in Linked Data. We consider this problem from the perspective of information retrieval in this paper, thus propose a learning to rank based framework, which combines various similarity measures to retrieve the relevant datasets for a given dataset. Specifically, inspired by the idea of collaborative filtering, an effective similarity measure called collaborative similarity is proposed. Experimental results show that the collaborative similarity measure is effective for dataset interlinking, and the learning to rank based framework can significantly increase the performance.
Contextual Query Intent Extraction for Paid Search Selection BIBAFull-Text 71-72
  Pengqi Liu; Javad Azimi; Ruofei Zhang
Paid Search algorithms play an important role in online advertising where a set of related ads is returned based on a searched query. The Paid Search algorithms mostly consist of two main steps. First, a given searched query is converted to different sub-queries or similar phrases which preserve the core intent of the query. Second, the generated sub-queries are matched to the ads bidded keywords in the data set, and a set of ads with highest utility measuring relevance to the original query are returned. The focus of this paper is optimizing the first step by proposing a contextual query intent extraction algorithm to generate sub-queries online which preserve the intent of the original query the best. Experimental results over a very large real-world data set demonstrate the superb performance of proposed approach in optimizing both relevance and monetization metrics compared with one of the existing successful algorithms in our system.
Towards Hierarchies of Search Tasks & Subtasks BIBAFull-Text 73-74
  Rishabh Mehrotra; Emine Yilmaz
Current search systems do not provide adequate support for users tackling complex tasks due to which the cognitive burden of keeping track of such tasks is placed on the searcher. As opposed to recent approaches to search task extraction, a more naturalistic viewpoint would involve viewing query logs as hierarchies of tasks with each search task being decomposed into more focussed sub-tasks. In this work, we propose an efficient Bayesian nonparametric model for extracting hierarchies of such tasks & subtasks. The proposed approach makes use of the multi-relational aspect of query associations which are important in identifying query-task associations. We describe a greedy agglomerative model selection algorithm based on the Gamma-Poisson conjugate mixture that take just one pass through the data to learn a fully probabilistic, hierarchical model of trees that is capable of learning trees with arbitrary branching structures as opposed to the more common binary structured trees. We evaluate our method based on real world query log data based on query term prediction. To the best of our knowledge, this work is the first to consider hierarchies of search tasks and subtasks.
On Topology of Baidu's Association Graph Based on General Recommendation Engine and Users' Behavior BIBAFull-Text 75-76
  Cong Men; Wanwan Tang; Po Zhang; Junqi Hou
To better meet users' underlying navigational requirement, search engines like Baidu has developed general recommendation engine and provided related entities on the right side of the search engine results page (SERP). However, users' behavior have not been well investigated after the association of individual queries in search engine. To better understand users' navigational activities, we propose a new method to map users' behavior to an association graph and make graph analysis. Interesting properties like clustering and assortativity are found in this association graph. This study provides a new perspective on research of semantic network and users' navigational behavior on SERP.
Join Size Estimation on Boolean Tensors of RDF Data BIBAFull-Text 77-78
  Saskia Metzler; Pauli Miettinen
The Resource Description Framework (RDF) represents information as subject -- predicate -- object triples. These triples are commonly interpreted as a directed labelled graph. We instead interpret the data as a 3-way Boolean tensor. Standard SPARQL queries then can be expressed using elementary Boolean algebra operations. We show how this representation helps to estimate the size of joins. Such estimates are valuable for query handling and our approach might yield more efficient implementations of SPARQL query processors.
Navigation Leads Selection Considering Navigational Value of Keywords BIBAFull-Text 79-80
  Robert Moro; Maria Bielikova
Searching a vast information space such as the Web presents a challenging task and even more so, if the domain is unknown and the character of the task is thus exploratory in its nature. We have proposed a method of exploratory navigation based on navigation leads, i.e. terms that help users to filter the information space of a digital library. In this paper, we focus on the selection of the leads considering their navigational value. We employ clustering based on topic modeling using LDA (Latent Dirichlet Allocation). We present results of a preliminary evaluation on the Annota dataset containing more than 50,000 research papers.
A Recommender System for Connecting Patients to the Right Doctors in the HealthNet Social Network BIBAFull-Text 81-82
  Fedelucio Narducci; Cataldo Musto; Marco Polignano; Marco de Gemmis; Pasquale Lops; Giovanni Semeraro
In this work we present a semantic recommender system able to suggest doctors and hospitals that best fit a specific patient profile. The recommender system is the core component of the social network named HealthNet (HN). The recommendation algorithm first computes similarities among patients, and then generates a ranked list of doctors and hospitals suitable for a given patient profile, by exploiting health data shared by the community. Accordingly, the HN user can find her most similar patients, look how they cured their diseases, and receive suggestions for solving her problem. Currently, the alpha version of HN is available only for Italian users, but in the next future we want to extend the platform to other languages. We organized three focus groups with patients, practitioners, and health organizations in order to obtain comments and suggestions. All of them proved to be very enthusiastic by using the HN platform.
The Importance of Pronouns to Sentiment Analysis: Online Cancer Survivor Network Case Study BIBAFull-Text 83-84
  Nir Ofek; Lior Rokach; Cornelia Caragea; John Yen
Online health communities are a major source for patients and their informal caregivers in the process of gathering information and seeking social support. The Cancer Survivors Network of the American Cancer Society has many users and presents a large number of user interactions with regards to coping with cancer. Sentiment analysis is an important process in understanding members' needs and concerns and the impact of users' responses on other members. It aims to determine the participants' subjective attitude and reflect their emotions. Analyzing the sentiment of posts in online health communities enables the investigation of various factors such as what affects the sentiment change and discovery of sentiment change patterns. Since each writer has his or her own personality, and temporal emotional state, behavioral traits can be reflected in the writer's writing style. Pronouns are function-words which often convey some unique styling patterns into the texts. Drawing on a lexical approach to emotions, we conduct factor analysis on the use of pronouns in self-descriptions texts. Our analysis shows that the usage of pronouns has an effect on sentiment classification. Moreover, we evaluated the use of pronouns in our domain, and found it different than standard English usage.
A Semantic Hybrid Approach for Sound Recommendation BIBAFull-Text 85-86
  Vito Claudio Ostuni; Tommaso Di Noia; Eugenio Di Sciascio; Sergio Oramas; Xavier Serra
In this work we describe a hybrid recommendation approach for recommending sounds to users by exploiting and semantically enriching textual information such as tags and sounds descriptions. As a case study we used Freesound, a popular site for sharing sound samples which counts more than 4 million registered users. Tags and textual sound descriptions are exploited to extract and link entities to external ontologies such as WordNet and DBpedia. The enriched data are eventually merged with a domain specific tagging ontology to form a knowledge graph. Based on this latter, recommendations are then computed using a semantic version of the feature combination hybrid approach. An evaluation on historical data shows improvements with respect to state of the art collaborative algorithms.
Exploring Communities for Effective Location Prediction BIBAFull-Text 87-88
  Jun Pang; Yang Zhang
Humans are social animals, they interact with different communities to conduct different activities. The literature has shown that human mobility is constrained by their social relations. In this work, we investigate the social impact on a user's mobility from his communities in order to conduct location prediction effectively. Through analysis of a real-life dataset, we demonstrate that (1) a user gets more influences from his communities than from all his friends; (2) his mobility is influenced only by a small subset of his communities; (3) influence from communities depends on social contexts. We further exploit a SVM to predict a user's future location based on his community information. Experimental results show that the model based on communities leads to more effective predictions than the one based on friends.
Investigating Factors Affecting Personal Data Disclosure BIBAFull-Text 89-90
  Christos Perentis; Michele Vescovi; Bruno Lepri
Mobile devices, sensors and social networks have dramatically increased the collection and sharing of personal and contextual information of individuals. Hence, users constantly make disclosure decisions on the basis of a difficult trade-off between using services and data protection. Understanding the factors linked to the disclosure behavior of personal information is a step forward to assist users in their decisions. In this paper, we model the disclosure of personal information and investigate their relationships not only with demographic and self-reported individual characteristics, but also with real behavior inferred from mobile phone usage. Preliminary results show that real behavior captured from mobile data relates with actual sharing behavior, providing the basis for future predictive models.
Exact Age Prediction in Social Networks BIBAFull-Text 91-92
  Bryan Perozzi; Steven Skiena
Predicting accurate demographic information about the users of information systems is a problem of interest in personalized search, ad targeting, and other related fields. Despite such broad applications, most existing work only considers age prediction as one of classification, typically into only a few broad categories.
   Here, we consider the problem of exact age prediction in social networks as one of regression. Our proposed method learns social representations which capture community information for use as covariates. In our preliminary experiments on a large real-world social network, it can predict age within 4.15 years on average, strongly outperforming standard network regression techniques when labeled data is sparse.
Aligning Multi-Cultural Knowledge Taxonomies by Combinatorial Optimization BIBAFull-Text 93-94
  Natalia Prytkova; Gerhard Weikum; Marc Spaniol
Large collections of digital knowledge have become valuable assets for search and recommendation applications. The taxonomic type systems of such knowledge bases are often highly heterogeneous, as they reflect different cultures, languages, and intentions of usage. We present a novel method to the problem of multi-cultural knowledge alignment, which maps each node of a source taxonomy onto a ranked list of most suitable nodes in the target taxonomy. We model this task as combinatorial optimization problems, using integer linear programming and quadratic programming. The quality of the computed alignments is evaluated, using large heterogeneous taxonomies about book categories.
Exploring Heterogeneity for Multi-Domain Recommendation with Decisive Factors Selection BIBAFull-Text 95-96
  Shuang Qiu; Jian Cheng; Xi Zhang; Hanqing Lu
To address the recommendation problems in the scenarios of multiple domains, in this paper, we propose a novel method, HMRec, which models both consistency and heterogeneity of users' multiple behaviors in a unified framework. Moreover, the decisive factors of each domain can also be captured by our approach successfully. Experiments on the real multi-domain dataset demonstrate the effectiveness of our model.
Crossing the Boundaries of Communities via Limited Link Injection for Information Diffusion In Social Networks BIBAFull-Text 97-98
  Dimitrios Rafailidis; Alexandros Nanopoulos
We propose a new link-injection method aiming at boosting the overall diffusion of information in social networks. Our approach is based on a diffusion-coverage score of the ability of each user to spread information over the network. Candidate links for injection are identified by a matrix factorization technique and link injection is performed by attaching links to users according to their score. We additionally perform clustering to identify communities in order to inject links that cross the boundaries of such communities. In our experiments with five real world networks, we demonstrate that our method can significantly spread the information diffusion by performing limited link injection, essential to real-world applications.
Repeat Consumption Recommendation Based on Users Preference Dynamics and Side Information BIBAFull-Text 99-100
  Dimitrios Rafailidis; Alexandros Nanopoulos
We present a Coupled Tensor Factorization model to recommend items with repeat consumption over time. We introduce a measure that captures the rate with which the preferences of each user shift over time. Repeat consumption recommendations are generated based on factorizing the coupled tensor, by weighting the importance of past user preferences according to the captured rate. We also propose a variant, where the diversity of the side information is taken into account, by higher weighting users that have more rare side information. Our experiments with real-world datasets from last.fm and MovieLens demonstrate that the proposed models outperform several baselines.
Spread it Good, Spread it Fast: Identification of Influential Nodes in Social Networks BIBAFull-Text 101-102
  Maria-Evgenia G. Rossi; Fragkiskos D. Malliaros; Michalis Vazirgiannis
Understanding and controlling spreading dynamics in networks presupposes the identification of those influential nodes that will trigger an efficient information diffusion. It has been shown that the best spreaders are the ones located in the core of the network -- as produced by the k-core decomposition. In this paper we further refine the set of the most influential nodes, showing that the nodes belonging to the best K-truss subgraph, as identified by the K-truss decomposition of the network, perform even better leading to faster and wider epidemic spreading.
Probabilistic Deduplication of Anonymous Web Traffic BIBAFull-Text 103-104
  Rishiraj Saha Roy; Ritwik Sinha; Niyati Chhaya; Shiv Saini
Cookies and log in-based authentication often provide incomplete data for stitching website visitors across multiple sources, necessitating probabilistic deduplication. We address this challenge by formulating the problem as a binary classification task for pairs of anonymous visitors. We compute visitor proximity vectors by converting categorical variables like IP addresses, product search keywords and URLs with very high cardinalities to continuous numeric variables using the Jaccard coefficient for each attribute. Our method achieves about 90% AUC and F-scores in identifying whether two cookies map to the same visitor, while providing insights on the relative importance of available features in Web analytics towards the deduplication process.
Pushing the Limits of Instance Matching Systems: A Semantics-Aware Benchmark for Linked Data BIBAFull-Text 105-106
  Tzanina Saveta; Evangelia Daskalaki; Giorgos Flouris; Irini Fundulaki; Melanie Herschel; Axel-Cyrille Ngonga Ngomo
The architectural choices behind the Data Web have led to the publication of large interrelated data sets that contain different descriptions for the same real-world objects. Due to the mere size of current online datasets, such duplicate instances are most commonly detected (semi-)automatically using instance matching frameworks. Choosing the right framework for this purpose remains tedious, as current instance matching benchmarks fail to provide end users and developers with the necessary insights pertaining to how current frameworks behave when dealing with real data. In this poster, we present the Semantic Publishing Instance Matching Benchmark (SPIMBENCH) which allows the benchmarking of instance matching systems against not only structure-based and value-based test cases, but also against semantics-aware test cases based on OWL axioms. SPIMBENCH features a scalable data generator and a weighted gold standard that can be used for debugging instance matching systems and for reporting how well they perform in various matching tasks.
Propagating Expiration Decisions in a Search Engine Result Cache BIBAFull-Text 107-108
  Fethi Burak Sazoglu; Özgür Ulusoy; Ismail Sengor Altingovde; Rifat Ozcan; Berkant Barla Cambazoglu
Detecting stale queries in a search engine result cache is an important problem. In this work, we propose a mechanism that propagates the expiration decision for a query to similar queries in the cache to re-adjust their time-to-live values.
Semantics-Driven Implicit Aspect Detection in Consumer Reviews BIBAFull-Text 109-110
  Kim Schouten; Nienke de Boer; Tjian Lam; Marijtje van Leeuwen; Ruud van Luijk; Flavius Frasincar
With consumer reviews becoming a mainstream part of e-commerce, a good method of detecting the product or service aspects that are discussed is desirable. This work focuses on detecting aspects that are not literally mentioned in the text, or implicit aspects. To this end, a co-occurrence matrix of synsets from WordNet and implicit aspects is constructed. The semantic relations that exist between synsets in WordNet are exploited to enrich the co-occurrence matrix with more contextual information. Comparing this method with a similar method which is not semantics-driven clearly shows the benefit of the proposed method. Especially corpora of limited size seem to benefit from the added semantic context.
AutoRec: Autoencoders Meet Collaborative Filtering BIBAFull-Text 111-112
  Suvash Sedhain; Aditya Krishna Menon; Scott Sanner; Lexing Xie
This paper proposes AutoRec, a novel autoencoder framework for collaborative filtering (CF). Empirically, AutoRec's compact and efficiently trainable model outperforms state-of-the-art CF techniques (biased matrix factorization, RBM-CF and LLORMA) on the Movielens and Netflix datasets.
Generating Quiz Questions from Knowledge Graphs BIBAFull-Text 113-114
  Dominic Seyler; Mohamed Yahya; Klaus Berberich
We propose an approach to generate natural language questions from knowledge graphs such as DBpedia and YAGO. We stage this in the setting of a quiz game. Our approach, though, is general enough to be applicable in other settings. Given a topic of interest (e.g., Soccer) and a difficulty (e.g., hard), our approach selects a query answer, generates a SPARQL query having the answer as its sole result, before verbalizing the question.
Measuring and Characterizing Nutritional Information of Food and Ingestion Content in Instagram BIBAFull-Text 115-116
  Sanket S. Sharma; Munmun De Choudhury
Social media sites like Instagram have emerged as popular platforms for sharing ingestion and dining experiences. However research on characterizing the nutritional information embedded in such content is limited. In this paper, we develop a computational method to extract nutritional information, specifically calorific content from Instagram food posts. Next, we explore how the community reacts specifically to healthy versus non-healthy food postings. Based on a crowdsourced approach, our method was found to detect calorific content in posts with 89% accuracy. We further show the use of Instagram as a platform where sharing of moderately healthy food content is common, and such content also receives the most support from the community.
Helping Users Understand Their Web Footprints BIBAFull-Text 117-118
  Lisa Singh; Hui Yang; Micah Sherr; Yifang Wei; Andrew Hian-Cheong; Kevin Tian; Janet Zhu; Sicong Zhang; Tavish Vaidya; Elchin Asgarli
To help users better understand the potential risks associated with publishing data publicly, and the types of data that can be inferred by combining data from multiple online sources, we introduce a novel information exposure detection framework that generates and analyzes the web footprints users leave across the social web. We propose to use probabilistic operators, free text attribute extraction, and a population-based inference engine to generate the web footprints. Evaluation over public profiles from multiple sites shows that our framework successfully detects and quantifies information exposure using a small amount of non-sensitive initial knowledge.
Detecting Concept-level Emotion Cause in Microblogging BIBAFull-Text 119-120
  Shuangyong Song; Yao Meng
In this paper, we propose a Concept-level Emotion Cause Model (CECM), instead of the mere word-level models, to discover causes of microblogging users' diversified emotions on specific hot event. A modified topic-supervised biterm topic model is utilized in CECM to detect "emotion topics" in event-related tweets, and then context-sensitive topical PageRank is utilized to detect meaningful multiword expressions as emotion causes. Experimental results on a dataset from Sina Weibo, one of the largest microblogging websites in China, show CECM can better detect emotion causes than baseline methods.
Topical Word Importance for Fast Keyphrase Extraction BIBAFull-Text 121-122
  Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
We propose an improvement on a state-of-the-art keyphrase extraction algorithm, Topical PageRank (TPR), incorporating topical information from topic models. While the original algorithm requires a random walk for each topic in the topic model being used, ours is independent of the topic model, computing but a single PageRank for each text regardless of the amount of topics in the model. This increases the speed drastically and enables it for use on large collections of text using vast topic models, while not altering performance of the original algorithm.
When Topic Models Disagree: Keyphrase Extraction with Multiple Topic Models BIBAFull-Text 123-124
  Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
We explore how the unsupervised extraction of topic-related keywords benefits from combining multiple topic models. We show that averaging multiple topic models, inferred from different corpora, leads to more accurate keyphrases than when using a single topic model and other state-of-the-art techniques. The experiments confirm the intuitive idea that a prerequisite for the significant benefit of combining multiple models is that the models should be sufficiently different, i.e., they should provide distinct contexts in terms of topical word importance.
Modeling User Activities on the Web using Paragraph Vector BIBAFull-Text 125-126
  Yukihiro Tagami; Hayato Kobayashi; Shingo Ono; Akira Tajima
Modeling user activities on the Web is a key problem for various Web services, such as news article recommendation and ad click prediction. In this paper, we propose an approach that summarizes each sequence of user activities using the Paragraph Vector, considering users and activities as paragraphs and words, respectively. The learned user representations are used among the user-related prediction tasks in common. We evaluate this approach on two data sets based on logs from Web services of Yahoo! JAPAN. Experimental results demonstrate the effectiveness of our proposed methods.
Lights, Camera, Action: Knowledge Extraction from Movie Scripts BIBAFull-Text 127-128
  Niket Tandon; Gerhard Weikum; Gerard de Melo; Abir De
With the success of large knowledge graphs, research on automatically acquiring commonsense knowledge is revived. One kind of knowledge that has not received attention is that of human activities. This paper presents an information extraction pipeline for systematically distilling activity knowledge from a corpus of movie scripts. Our semantic frames capture activities together with their participating agents and their typical spatial, temporal and sequential contexts. The resulting knowledge base comprises about 250,000 activities with links to specific movie scenes where they occur.
Assessing the Reliability of Facebook User Profiling BIBAFull-Text 129-130
  Thomas Theodoridis; Symeon Papadopoulos; Yiannis Kompatsiaris
User profiling is an essential component of most modern online services offered upon user registration. Profiling typically involves the tracking and processing of users' online traces (e.g., page views/clicks) with the goal of inferring attributes of interest for them. The primary motivation behind profiling is to improve the effectiveness of advertising by targeting users with appropriately selected ads based on their profile attributes, e.g., interests, demographics, etc. Yet, there has been an increasing number of cases, where the advertising content users are exposed to is either irrelevant or not possible to explain based on their online activities. More disturbingly, automatically inferred user attributes are often used to make real-world decisions (e.g., job candidate selection) without the knowledge of users. We argue that many of these errors are inherent in the underlying user profiling process. To this end, we attempt to quantify the extent of such errors, focusing on a dataset of Facebook users and their likes, and conclude that profiling-based targeting is highly unreliable for a sizeable subset of users.
Modelling Time-aware Search Tasks for Search Personalisation BIBAFull-Text 131-132
  Thanh Tien Vu; Alistair Willis; Dawei Song
Recent research has shown that mining and modelling search tasks helps improve the performance of search personalisation. Some approaches have been proposed to model a search task using topics discussed in relevant documents, where the topics are usually obtained from human-generated online ontology such as Open Directory Project. A limitation of these approaches is that many documents may not contain the topics covered in the ontology. Moreover, the previous studies largely ignored the dynamic nature of the search task; with the change of time, the search intent and user interests may also change.
   This paper addresses these problems by modelling search tasks with time-awareness using latent topics, which are automatically extracted from the task's relevance documents by an unsupervised topic modelling method (i.e., Latent Dirichlet Allocation). In the experiments, we utilise the time-aware search task to re-rank result list returned by a commercial search engine and demonstrate a significant improvement in the ranking quality.
Rethink Targeting: Detect 'Smart Cheating' in Online Advertising through Causal Inference BIBAFull-Text 133-134
  Pengyuan Wang; Dawei Yin; Jian Yang; Yi Chang; Marsha Meytlis
In online advertising, one of the central questions of ad campaign assessment is whether the ad truly adds values to the advertisers. To measure the incremental effect of ads, the ratio of the success rates of the users who were and who were not exposed to ads is usually calculated to represent ad effectiveness. Many existing campaigns simply target the users with high predicted success (e.g. purchases, searches) rate, which often neglect the fact that even without ad exposure, the targeted group of users might still perform the success actions, and hence show higher ratio than the true ad effectiveness. We call such phenomena 'smart cheating'. Failure to discount smart cheating when assessing ad campaigns may favor the targeting plan that cheats hard, but such targeting does not lead to the maximal incremental success actions and results in wasted budget. In this paper we define and quantify smart cheating with a smart cheating ratio (SCR) through causal inference. We apply our approach to multiple real ad campaigns, and find that smart cheating exists extensively and can be rather severe in current advertising industry.
Questions vs. Queries in Informational Search Tasks BIBAFull-Text 135-136
  Ryen W. White; Matthew Richardson; Wen-tau Yih
Search systems traditionally require searchers to formulate information needs as keywords rather than in a more natural form, such as questions. Recent studies have found that Web search engines are observing an increase in the fraction of queries phrased as natural language. As part of building better search engines, it is important to understand the nature and prevalence of these intentions, and the impact of this increase on search engine performance. In this work, we show that while 10.3% of queries issued to a search engine have direct question intent, only 3.2% of them are formulated as natural language questions. We investigate whether search engines perform better when search intent is stated as queries or questions, and we find that they perform equally well to both.
Why Do You Follow Him?: Multilinear Analysis on Twitter BIBAFull-Text 137-138
  Yuto Yamaguchi; Mitsuo Yoshida; Christos Faloutsos; Hiroyuki Kitagawa
Why does Smith follow Johnson on Twitter? In most cases, the reason why users follow other users is unavailable. In this work, we answer this question by proposing TagF, which analyzes the who-follows-whom network (matrix) and the who-tags-whom network (tensor) simultaneously. Concretely, our method decomposes a coupled tensor constructed from these matrix and tensor. The experimental results on million-scale Twitter networks show that TagF uncovers different, but explainable reasons why users follow other users.
Topic-aware Social Influence Minimization BIBAFull-Text 139-140
  Qipeng Yao; Ruisheng Shi; Chuan Zhou; Peng Wang; Li Guo
In this paper, we address the problem of minimizing the negative influence of undesirable things in a network by blocking a limited number of nodes from a topic modeling perspective. When undesirable thing such as a rumor or an infection emerges in a social network and part of users have already been infected, our goal is to minimize the size of ultimately infected users by blocking k nodes outside the infected set. We first employ the HDP-LDA and KL divergence to analysis the influence and relevance from a topic modeling perspective. Then two topic-aware heuristics based on betweenness and out-degree for finding approximate solutions to this problem are proposed. Using two real networks, we demonstrate experimentally the high performance of the proposed models and learning schemes.
Topic-aware Source Locating in Social Networks BIBAFull-Text 141-142
  Wenyu Zang; Chuan Zhou; Li Guo; Peng Zhang
In this paper we address the problem of source locating in social networks from a topic modeling perspective. From the observation that the topic factor can help infer the propagation paths, we propose a topic-aware source locating method based on topic analysis of propagation items and participants. We evaluate our algorithm on both generated and real-world datasets. The experimental results show significant improvement over existing popular methods.
Towards Entity Correctness, Completeness and Emergence for Entity Recognition BIBAFull-Text 143-144
  Lei Zhang; Yunpeng Dong; Achim Rettinger
Linking words or phrases in unstructured text to entities in knowledge bases is the problem of entity recognition and disambiguation. In this paper, we focus on the task of entity recognition in Web text to address the challenges of entity correctness, completeness and emergence that existing approaches mainly suffer from. Experimental results show that our approach significantly outperforms the state-of-the-art approaches in terms of precision, F-measure, micro-accuracy and macro-accuracy, while still preserving high recall.
Identifying Regrettable Messages from Tweets BIBAFull-Text 145-146
  Lu Zhou; Wenbo Wang; Keke Chen
Inappropriate tweets may cause severe damages on the authors' reputation or privacy. However, many users do not realize the potential damages when publishing such tweets. Published tweets have lasting effects that may not be completely eliminated by simple deletion, because other users may have read them or third-party tweet analysis platforms have cached them. In this paper, we study the problem of identifying regrettable tweets from normal individual users, with the ultimate goal of reducing the occurrences of regrettable tweets. We explore the contents of a set of tweets deleted by sample normal users to understand the regrettable tweets. With a set of features describing the identifiable reasons, we can develop classifiers to effectively distinguish such regrettable tweets from normal tweets.


Who are the American Vegans related to Brad Pitt?: Exploring Related Entities BIBAFull-Text 151-154
  Nitish Aggarwal; Kartik Asooja; Housam Ziad; Paul Buitelaar
In this demo, we present Entity Relatedness Graph (EnRG), a focused related entities explorer, which provides the users with a dynamic set of filters and facets. It gives a ranked lists of related entities to a given entity, and clusters them using the different filters. For instance, using EnRG, one can easily find the American vegans related to Brad Pitt or Irish universities related to Semantic Web. Moreover, EnRG helps a user in discovering the provenance for implicit relations between two entities. EnRG uses distributional semantics to obtain the relatedness scores between two entities.
TeMex: The Web Template Extractor BIBAFull-Text 155-158
  Julián Alarte; David Insa; Josep Silva; Salvador Tamarit
This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a predefined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism increases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation.
Roomba: Automatic Validation, Correction and Generation of Dataset Metadata BIBAFull-Text 159-162
  Ahmad Assaf; Aline Senart; Raphaël Troncy
Data is being published by both the public and private sectors and covers a diverse set of domains ranging from life sciences to media or government data. An example is the Linked Open Data (LOD) cloud which is potentially a gold mine for organizations and individuals who are trying to leverage external data sources in order to produce more informed business decisions. Considering the significant variation in size, the languages used and the freshness of the data, one realizes that spotting spam datasets or simply finding useful datasets without prior knowledge is increasingly complicated. In this paper, we propose Roomba, a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles. While Roomba is generic, we target CKAN-based data portals and we validate our approach against a set of open data portals including the Linked Open Data (LOD) cloud as viewed on the DataHub. The results demonstrate that the general state of various datasets and groups, including the LOD cloud group, needs more attention as most of the datasets suffer from bad quality metadata and lack some informative metrics that are required to facilitate dataset search.
AutoTag 'n Search My Photos: Leveraging the Social Graph for Photo Tagging BIBAFull-Text 163-166
  Shobana Balakrishnan; Surajit Chaudhuri; Vivek Narasayya
Personal photo collections are large and growing rapidly. Today, it is difficult to search such a photo collection for people who occur in them since it is tedious to manually associate face tags in photos. The key idea is to learn face models for friends and family of the user using tagged photos in a social graph such as Facebook as training examples. These face models are then used to automatically tag photos in the collection, thereby making it more searchable and easier to organize. To illustrate this idea we have developed a Windows app called AutoTag 'n Search My Photos. In this demo paper we describe the architecture, user interaction and controls, and our initial learnings from deploying the app.
Two New Gestures to Zoom: Enhancing Online Maps Services BIBAFull-Text 167-170
  Alessio Bellino
Online services such as Google Maps or Open Street Maps allow the exploration of maps on smartphones and tablets. The gestures used are the pinch to adjust the zoom level and the drag to move the map. In this paper, two new gestures to adjust the zoom level of maps are presented. Both gestures with slight differences allow the identification of a target area to zoom, which is enlarged automatically up to cover the whole map container. The proposed gestures are added to the traditional ones (drag, pinch and flick) without any overlap. Therefore, users do not need to change their regular practices. They have just two more options to control the zoom level. One of the most relevant and appreciated advantages has to do with the gesture for smartphones (Tap&Tap): this allows users to control the zoom level with just one hand. The traditional pinch gesture, instead, needs two hands. According to the test results on new gestures in comparison with the traditional pinch, 30% of time is saved on tablets (Two-Finger-Tap gesture) whereas 14% on smartphones (Tap&Tap gesture).
Champagne: A Web Tool for the Execution of Crowdsourcing Campaigns BIBAFull-Text 171-174
  Carlo Bernaschina; Ilio Catallo; Piero Fraternali; Davide Martinenghi; Marco Tagliasacchi
We present Champagne, a web tool for the execution of crowdsourcing campaigns. Through Champagne, task requesters can model crowdsourcing campaigns as a sequence of choices regarding different, independent crowdsourcing design decisions. Such decisions include, e.g., the possibility of qualifying some workers as expert reviewers, or of combining different quality assurance techniques to be used during campaign execution. In this regard, a walkthrough example showcasing the capabilities of the platform is reported. Moreover, we show that our modular approach in the design of campaigns overcomes many of the limitations exposed by the major platforms available in the market.
Social Glass: A Platform for Urban Analytics and Decision-making Through Heterogeneous Social Data BIBAFull-Text 175-178
  Stefano Bocconi; Alessandro Bozzon; Achilleas Psyllidis; Christiaan Titos Bolivar; Geert-Jan Houben
This demo presents Social Glass, a novel web-based platform that supports the analysis, valorisation, integration, and visualisation of large-scale and heterogeneous urban data in the domains of city planning and decision-making. The platform systematically combines publicly available social datasets from municipalities together with social media streams (e.g. Twitter, Instagram and Foursquare) and resources from knowledge repositories. It further enables the mapping of demographic information, human movement patterns, place popularity, traffic conditions, as well as citizens' and visitors' opinions and preferences with regard to specific venues in the city. Social Glass will be demonstrated through several real-world case studies, that exemplify the framework's conceptual properties, and its potential value as a solution for urban analytics and city-scale event monitoring and assessment.
Browser Record and Replay as a Building Block for End-User Web Automation Tools BIBAFull-Text 179-182
  Sarah Chasins; Shaon Barman; Rastislav Bodik; Sumit Gulwani
To build a programming by demonstration (PBD) web scraping tool for end users, one needs two central components: a list finder, and a record and replay tool. A list finder extracts logical tables from a webpage. A record and replay (R+R) system records a user's interactions with a webpage, and replays them programmatically. The research community has invested substantial work in list finding -- variously called wrapper induction, structured data extraction, and template detection. In contrast, researchers largely considered the browser R+R problem solved until recently, when webpage complexity and interactivity began to rise. We argue that the increase in interactivity necessitates the use of new, more robust R+R approaches, which will facilitate the PBD web tools of the future. Because robust R+R is difficult to build and understand, we argue that tool developers need an R+R layer that they can treat as a black box. We have designed an easy-to-use API that allows programmers to use and even customize R+R, without having to understand R+R internals. We have instantiated our API in Ringer, our robust R+R tool. We use the API to implement WebCombine, a PBD scraping tool. A WebCombine user demonstrates how to collect the first row of a relational dataset, and the tool collects all remaining rows. WebCombine uses the Ringer API to handle navigation between pages, enabling users to scrape from modern, interaction-heavy pages. We demonstrate WebCombine by collecting a 3,787,146 row dataset from Google Scholar that allows us to explore the relationship between researchers' years of experience and their papers' citation counts.
DiagramFlyer: A Search Engine for Data-Driven Diagrams BIBAFull-Text 183-186
  Zhe Chen; Michael Cafarella; Eytan Adar
A large amount of data is available only through data-driven diagrams such as bar charts and scatterplots. These diagrams are stylized mixtures of graphics and text and are the result of complicated data-centric production pipelines. Unfortunately, neither text nor image search engines exploit these diagram-specific properties, making it difficult for users to find relevant diagrams in a large corpus. In response, we propose DiagramFlyer, a search engine for finding data-driven diagrams on the web. By recovering the semantic roles of diagram components (e.g., axes, labels, etc.), we provide faceted indexing and retrieval for various statistical diagrams. A unique feature of DiagramFlyer is that it is able to "expand" queries to include not only exactly matching diagrams, but also diagrams that are likely to be related in terms of their production pipelines. We demonstrate the resulting search system by indexing over 300k images pulled from over 150k PDF documents.
Smith Search: Opinion-Based Restaurant Search Engine BIBAFull-Text 187-190
  Jaehoon Choi; Donghyeon Kim; Donghee Choi; Sangrak Lim; Seongsoon Kim; Jaewoo Kang; Youngjae Choi
Search engines have become an important decision-making tool today. Unfortunately, they still need to improve in answering complex queries. The answers to complex decision-making queries such as "best burgers and fries" and "good restaurants for anniversary dinner," are often subjective. The most relevant answer to the query can be obtained by only collecting people's opinions about the query, which are expressed in various venues on the Web. Collected opinions are converted into a "consensus" list. All of this should be processed at query time, which is impossible under the current search paradigm. To address this problem, we introduce Smith, a novel opinion-based restaurant search engine. Smith actively processes opinions on the Web, blogs, review boards, and other forms of social media at index time, and produces consensus answers from opinions at query time. The Smith search app (iOS) is available for download at http://www.smithsearches.com/introduction/.
whoVIS: Visualizing Editor Interactions and Dynamics in Collaborative Writing Over Time BIBAFull-Text 191-194
  Fabian Flöck; Maribel Acosta
The visualization of editor interaction dynamics and provenance of content in revisioned, collaboratively written documents has the potential to allow for more transparency and intuitive understanding of the intricate mechanisms inherent to collective content production. Although approaches exist to build editor interactions from individual word changes in Wikipedia articles, they do not allow to inquire into individual interactions, and have yet to be implemented as usable end-user tools.
   We thus present whoVIS, a web tool to mine and visualize editor interactions in Wikipedia over time. whoVIS integrates novel features with existing methods, tailoring them to the use case of understanding intra-article disagreement between editors. Using real Wikipedia examples, our system demonstrates the combination of various visualization techniques to identify different social dynamics and explore the evolution of an article that would be particularly hard for end-users to investigate otherwise.
VizCurator: A Visual Tool for Curating Open Data BIBAFull-Text 195-198
  Bahar Ghadiri Bashardoost; Christina Christodoulakis; Soheil Hassas Yeganeh; Renée J. Miller; Kelly Lyons; Oktie Hassanzadeh
Vizcurator permits the exploration, understanding and curation of open RDF data, its schema, and how it has been linked to other sources. We provide visualizations that enable one to seamlessly navigate through RDFS and RDF layers and quickly understand the open data, how it has been mapped or linked, how it has been structured (and could be restructured), and how deeply it has been related to other open data sources. More importantly, Vizcurator provides a rich set of tools for data curation. It suggests possible improvements to the structure of the data and enables curators to make informed decisions about enhancements to the exploration and exploitation of the data. Moreover, Vizcurator facilitates the mining of temporal resources and the definition of temporal constraints through which the curator can identify conflicting facts. Finally, Vizcurator can be used to create new binary temporal relations by reifying base facts and linking them to temporal resources. We will demonstrate Vizcurator using LinkedCT.org, a five-star open data set mapped from the XML NIH clinical trials data (clinicaltrials.gov) that we have been maintaining and curating for several years.
queryCategorizr: A Large-Scale Semi-Supervised System for Categorization of Web Search Queries BIBAFull-Text 199-202
  Mihajlo Grbovic; Nemanja Djuric; Vladan Radosavljevic; Narayan Bhamidipati; Jordan Hawker; Caleb Johnson
Understanding interests expressed through user's search query is a task of critical importance for many internet applications. To help identify user interests, web engines commonly utilize classification of queries into one or more pre-defined interest categories. However, majority of the queries are noisy short texts, making accurate classification a challenging task. In this demonstration, we present queryCategorizr, a novel semi-supervised learning system that embeds queries into low-dimensional vector space using a neural language model applied on search log sessions, and classifies them into general interest categories while relying on a small set of labeled queries. Empirical results on large-scale data show that queryCategorizr outperforms the current state-of-the-art approaches. In addition, we describe a Graphical User Interface (GUI) that allows users to query the system and explore classification results in an interactive manner.
DIVINA: Discovering Vulnerabilities of Internet Accounts BIBAFull-Text 203-206
  Ziad Ismail; Danai Symeonidou; Fabian Suchanek
Internet users typically have several online accounts -- such as mail accounts, cloud storage accounts, or social media accounts. The security of these accounts is often intricately linked: The password of one account can be reset by sending an email to another account; the data of one account can be backed up on another account; one account can only be accessed by two-factor authentication through a second account; and so forth. This poses three challenges: First, if a user loses one or several of his passwords, can he still access his data? Second, how many passwords does an attacker need in order to access the data? And finally, how many passwords does an attacker need in order to irreversibly delete the user's data? In this paper, we model the dependencies of online accounts in order to help the user discover security weaknesses. We have implemented our system and invite users to try it out on their real accounts.
SmartComposition: Enhanced Web Components for a Better Future of Web Development BIBAFull-Text 207-210
  Michael Krug; Martin Gaedke
In this paper, we introduce the usage of enhanced Web Components to create web applications with multi-device capabilities by composition. By using the latest developments of the family of W3C standards called "Web Components" that we extent with dedicated communication and synchronization functionality, web developers are enabled to create web applications with ease. We enhance Web Components with an event-based communication channel, which is not limited to a single browser window. With our approach, applications using the extended SmartComponents and an additional synchronization service also support multi-device scenarios. In contrast to other widget-based approaches (W3C Widgets, OpenSocial containers), the usage of SmartComponents does not require a dedicated platform, like Apache Rave. SmartComponents are based on standard web technologies, are natively supported by recent web browsers and loosely coupled using our extension. This ensures a high level of reuse. We show how SmartComponents are structured, can be created and used. Furthermore, we explain how the communication aspect is integrated and multi-device communication is achieved. Finally, we describe our demonstration by outlining two example applications.
Ajax API Self-adaptive Framework for End-to-end User BIBAFull-Text 211-214
  Xiang Li; Zhiyong Feng; Keman Huang; Shizhan Chen
Web developers often use Ajax API to build the rich Internet application (RIA). Due to the uncertainty of the environment, automatically switching among different Ajax APIs with similar functionality is important to guarantee the end-to-end performance. However, it is challenging and time-consuming because it needs to manually modify codes based on the API documentation. In this paper, we propose a framework to address the self-adaption and difficulty in invoking Ajax API. The Ajax API wrapping model, consisting of the specific and abstract components, is proposed to automatically construct the grammatical and functional semantic relations between Ajax APIs. Then switching module is introduced to support the automatic switching among different Ajax APIs, according to the user preference and QoS of Ajax APIs. Taking the map APIs, i.e. Google Map, Baidu Map, Gaode Map, 51 Map and Tencent Map as an example, the demo shows that the framework can facilitate the construction of RIA and improve adaptability of the application. The process of selection and switching in the different Ajax APIs is automatic and transparent to the users.
GalaxyExplorer: Influence-Driven Visual Exploration of Context-Specific Social Media Interactions BIBAFull-Text 215-218
  Xiaotong Liu; Srinivasan Parthasarathy; Han-Wei Shen; Yifan Hu
The ever-increasing size and complexity of social networks place a fundamental challenge to visual exploration and analysis tasks. In this paper, we present GalaxyExplorer, an influence-driven visual analysis system for exploring users of various influence and analyzing how they influence others in a social network. GalaxyExplorer reduces the size and complexity of a social network by dynamically retrieving theme-based graphs, and analyzing users' influence and passivity regarding specific themes and dynamics in response to disaster events. In GalaxyExplorer, a galaxy-based visual metaphor is introduced to simplify the visual complexity of a large graph with a focus+context view. Various interactions are supported for visual exploration. We present experimental results on real-world datasets that show the effectiveness of GalaxyExplorer in theme-aware influence analysis.
CubeViz: Exploration and Visualization of Statistical Linked Data BIBAFull-Text 219-222
  Michael Martin; Konrad Abicht; Claus Stadler; Axel-Cyrille Ngonga Ngomo; Tommaso Soru; Sören Auer
CubeViz is a flexible exploration and visualization platform for statistical data represented adhering to the RDF Data Cube vocabulary. If statistical data is provided adhering to the Data Cube vocabulary, CubeViz exhibits a faceted browsing widget allowing to interactively filter observations to be visualized in charts. Based on the selected structural part, CubeViz offers suitable chart types and options for configuring the visualization by users. In this demo we present the CubeViz visualization architecture and components, sketch its underlying API and the libraries used to generate the desired output. By employing advanced introspection, analysis and visualization bootstrapping techniques CubeViz hides the schema complexity of the encoded data in order to support a user-friendly exploration experience.
EXPOSÉ: EXploring Past news fOr Seminal Events BIBAFull-Text 223-226
  Arunav Mishra; Klaus Berberich
Recent increases in digitization and archiving efforts on news data have led to overwhelming amounts of online information for general users, thus making it difficult for them to retrospect on past events. One dimension along which past events can be effectively organized is time. Motivated by this idea, we introduce EXPOSÉ, an exploratory search system that explicitly uses temporal information associated with events to link different kinds of information sources for effective exploration of past events. In this demonstration, we use Wikipedia and news articles as two orthogonal sources. Wikipedia is viewed as an event directory that systematically lists seminal events in a year; news articles are viewed as a source of detailed information on each of these events. To this end, our demo includes several time-aware retrieval approaches that a user can employ for retrieving relevant news articles, as well as a timeline tool for temporal analysis and entity-based facets for filtering results.
A Serious Game Powered by Semantic Web technologies BIBAFull-Text 227-230
  Bernardo Pereira Nunes; Terhi Nurmikko-Fuller; Giseli Rabello Lopes; Chiara Renso
ISCOOL is an interactive educational platform that helps users develop their skills for objective text analysis and interpretation. The tools incorporated into ISCOOL bridge various disparate sources, including reference datasets for people, and organizations, as well as gazetteers, dictionaries and collections of historical facts. This data serves as the basis for educating learners about the processes of evaluating the implicit and implied content of written material, whilst also providing a wider context in which this information is accessed, interpreted and understood. In the course of gameplay, the user is prompted to choose images that best capture content of a read passage. The interactive features of the game simultaneously test the user's existing knowledge, and ability to critically analyse the text. Results can be saved and shared, allowing the players to continue to interact with the data through conversations with their peers, friends, and family members, and to disseminate information throughout their communities. Users will be able to draw connections between the information they encounter in ISCOOL, and their daily realities -- participants are empowered, informed and educated.
Geosocial Search: Finding Places based on Geotagged Social-Media Posts BIBAFull-Text 231-234
  Barak Pat; Yaron Kanza; Mor Naaman
Geographic search -- where the user provides keywords and receives relevant locations depicted on a map -- is a popular web application. Typically, such search is based on static geographic data. However, the abundant geotagged posts in microblogs such as Twitter and in social networks like Instagram, provide contemporary information that can be used to support geosocial search -- geographic search based on user activities in social media. Such search can point out where people talk (or tweet) about different topics. For example, the search results may show where people refer to "jogging", to indicate popular jogging places. The difficulty in implementing such search is that there is no natural partition of the space into "documents" as in ordinary web search. Thus, it is not always clear how to present results and how to rank and filter results effectively. In this paper, we demonstrate a two-step process of first, quickly finding the relevant areas by using an arbitrary indexed partition of the space, and secondly, applying clustering on discovered areas, to present more accurate results. We introduce a system that utilizes geotagged posts in geographic search and illustrate how different ranking methods can be used, based on the proposed two-step search process. The system demonstrates the effectiveness and usefulness of the approach.
Extracting knowledge from text using SHELDON, a Semantic Holistic framEwork for LinkeD ONtology data BIBAFull-Text 235-238
  Diego Reforgiato Recupero; Andrea Giovanni Nuzzolese; Sergio Consoli; Valentina Presutti; Misael Mongiovì; Silvio Peroni
SHELDON is the first true hybridization of NLP machine reading and the Semantic Web. It extracts RDF data from text using a machine reader: the extracted RDF graphs are compliant to Semantic Web and Linked Data. It goes further and applies Semantic Web practices and technologies to extend the current human-readable web. The input is represented by a sentence in any language. SHELDON includes different capabilities in order to extend machine reading to Semantic Web data: frame detection, topic extraction, named entity recognition, resolution and coreference, terminology extraction, sense tagging and disambiguation, taxonomy induction, semantic role labeling, type induction, sentiment analysis, citation inference, relation and event extraction, nice visualization tools which make use of the JavaScript infoVis Toolkit and RelFinder. A demo of SHELDON can be seen and used at http://wit.istc.cnr.it/stlab-tools/sheldon.
Cloud WorkBench: Benchmarking IaaS Providers based on Infrastructure-as-Code BIBAFull-Text 239-242
  Joel Scheuner; Jürgen Cito; Philipp Leitner; Harald Gall
Optimizing the deployment of applications in Infrastructure-as-a-Service clouds requires to evaluate the costs and performance of different combinations of cloud configurations which is unfortunately a cumbersome and error-prone process. In this paper, we present Cloud WorkBench (CWB), a concrete implementation of a cloud benchmarking Web service, which fosters the definition of reusable and representative benchmarks. We demonstrate the complete cycle of benchmarking an IaaS service with the sample benchmark SysBench. In distinction to existing work, our system is based on the notion of Infrastructure-as-Code, which is a state of the art concept to define IT infrastructure in a reproducible, well-defined, and testable way.
An Overview of Microsoft Academic Service (MAS) and Applications BIBAFull-Text 243-246
  Arnab Sinha; Zhihong Shen; Yang Song; Hao Ma; Darrin Eide; Bo-June (Paul) Hsu; Kuansan Wang
In this paper we describe a new release of a Web scale entity graph that serves as the backbone of Microsoft Academic Service (MAS), a major production effort with a broadened scope to the namesake vertical search engine that has been publicly available since 2008 as a research prototype. At the core of MAS is a heterogeneous entity graph comprised of six types of entities that model the scholarly activities: field of study, author, institution, paper, venue, and event. In addition to obtaining these entities from the publisher feeds as in the previous effort, we in this version include data mining results from the Web index and an in-house knowledge base from Bing, a major commercial search engine. As a result of the Bing integration, the new MAS graph sees significant increase in size, with fresh information streaming in automatically following their discoveries by the search engine. In addition, the rich entity relations included in the knowledge base provide additional signals to disambiguate and enrich the entities within and beyond the academic domain. The number of papers indexed by MAS, for instance, has grown from low tens of millions to 83 million while maintaining an above 95% accuracy based on test data sets derived from academic activities at Microsoft Research. Based on the data set, we demonstrate two scenarios in this work: a knowledge driven, highly interactive dialog that seamlessly combines reactive search and proactive suggestion experience, and a proactive heterogeneous entity recommendation.
Time-travel Translator: Automatically Contextualizing News Articles BIBAFull-Text 247-250
  Nam Khanh Tran; Andrea Ceroni; Nattiya Kanhabua; Claudia Niederée
Fully understanding an older news article requires context knowledge from the time of article creation. Finding information about such context is a tedious and time-consuming task, which distracts the reader. Simple contextualization via Wikification is not sufficient here. The retrieved context information has to be time-aware, concise (not full Wikipages) and focused on the coherence of the article topic. In this paper, we present Contextualizer, a web-based system that acquires additional information for supporting interpretations of a news article of interest that requires a mapping, in this case, a kind of time-travel translation between present context knowledge and context knowledge at time of text creation. For a given article, the system provides a GUI that allows users to highlight their interested keywords which are then used to construct appropriate queries for retrieving contextualization candidates. Contextualizer exploits different kinds of information such as temporal similarity and textual complementarity to re-rank the candidates and presents to users in a friendly and interactive web-based interface.
Kvasir: Seamless Integration of Latent Semantic Analysis-Based Content Provision into Web Browsing BIBAFull-Text 251-254
  Liang Wang; Sotiris Tasoulis; Teemu Roos; Jussi Kangasharju
The Internet is overloading its users with excessive information flows, so that effective content-based filtering becomes crucial in improving user experience and work efficiency. We build Kvasir, a semantic recommendation system, atop latent semantic analysis and other state-of-art technologies to seamlessly integrate an automated and proactive content provision service into web browsing. We utilize the power of Apache Spark to scale up Kvasir to a practical Internet service. Herein we present the architecture of Kvasir, along with our solutions to the technical challenges in the actual system implementation.
SemMobi: A Semantic Annotation System for Mobility Data BIBAFull-Text 255-258
  Fei Wu; Hongjian Wang; Zhenhui Li; Wang-Chien Lee; Zhuojie Huang
The wide adaptation of mobile devices embedded with modern positioning technology enables the collection of valuable mobility data from users. At the same time, the large-scale user-generated data from social media, such as geo-tagged tweets, provide rich semantic information about events and locations. The combination of the mobility data and social media data brings opportunities for us to study the semantics behind people's movement, i.e., understand why a person travels to a location at a particular time. Previous work have used map or POI (point of interest) database as source for semantics. However, those semantics are static, and thus missing important dynamic event information. To provide dynamic semantic annotation, we propose to use contextual social media. More specifically, the semantics could be landmark information (e.g., a museum or an arena) or event information (e.g., sports games or concerts). The SemMobi system implements our recently developed annotation method, which has been recently accepted to WWW 2015 conference. The annotation method annotates words to each mobility records based on local density of words, estimated by Kernel Density Estimation model. The annotated mobility data contain rich and interpretable information, therefore can benefit applications, such as personalized recommendation, targeted advertisement, and movement prediction. Our system is built upon large-scale tweet datasets. A user-friendly interface is designed to support interactive exploration of the result.
Towards An Interactive Keyword Search over Relational Databases BIBAFull-Text 259-262
  Zhong Zeng; Zhifeng Bao; Mong Li Lee; Tok Wang Ling
Keyword search over relational databases has been widely studied for the exploration of structured data in a user-friendly way. However, users typically have limited domain knowledge or are unable to precisely specify their search intention. Existing methods find the minimal units that contain all the query keywords, and largely ignore the interpretation of possible users' search intentions. As a result, users are often overwhelmed with a lot of irrelevant answers. Moreover, without a visually pleasing way to present the answers, users often have difficulty understanding the answers because of their complex structures. Therefore, we design an interactive yet visually pleasing search paradigm called ExpressQ. ExpressQ extends the keyword query language to include keywords that match meta-data, e.g., names of relations and attributes. These keywords are utilized to infer users' search intention. Each possible search intention is represented as a query pattern, whose meaning is described in human natural language. Through a series of user interactions, ExpressQ can determine the search intention of the user, and translate the corresponding query patterns into SQLs to retrieve answers to the query. The ExpressQ prototype is available at http://expressq.comp.nus.edu.sg.

WebSci Track Papers & Posters

A Study of Distinctiveness in Web Results of Two Search Engines BIBAFull-Text 267-273
  Rakesh Agrawal; Behzad Golshan; Evangelos Papalexakis
Google and Bing have emerged as the diarchy that arbitrates what documents are seen by Web searchers, particularly those desiring English language documents. We seek to study how distinctive are the top results presented to the users by the two search engines. A recent eye-tracking has shown that the web searchers decide whether to look at a document primarily based on the snippet and secondarily on the title of the document on the web search result page, and rarely based on the URL of the document. Given that the snippet and title generated by different search engines for the same document are often syntactically different, we first develop tools appropriate for conducting this study. Our empirical evaluation using these tools shows a surprising agreement in the results produced by the two engines for a wide variety of queries used in our study. Thus, this study raises the open question whether it is feasible to design a search engine that would produce results distinct from those produced by Google and Bing that the users will find helpful.
Correlation of Node Importance Measures: An Empirical Study through Graph Robustness BIBAFull-Text 275-281
  Mirza Basim Baig; Leman Akoglu
Graph robustness is a measure of resilience to failures and targeted attacks. A large body of research on robustness focuses on how to attack a given network by deleting a few nodes so as to maximally disrupt its connectedness. As a result, literature contains a myriad of attack strategies that rank nodes by their relative importance for this task. How different are these strategies? Do they pick similar sets of target nodes, or do they differ significantly in their choices? In this paper, we perform the first large scale empirical correlation analysis of attack strategies, i.e., the node importance measures that they employ, for graph robustness. We approach this task in three ways; by analyzing similarities based on (i) their overall ranking of the nodes, (ii) the characteristics of top nodes that they pick, and (iii) the dynamics of disruption that they cause on the network. Our study of 15 different (randomized, local, distance-based, and spectral) strategies on 68 real-world networks reveals surprisingly high correlations among node-attack strategies, consistent across all three types of analysis, and identifies groups of comparable strategies. These findings suggest that some computationally complex strategies can be closely approximated by simpler ones, and a few strategies can be used as a close proxy of the consensus among all of them.
Investigating Similarity Between Privacy Policies of Social Networking Sites as a Precursor for Standardization BIBAFull-Text 283-289
  Emma Cradock; David Millard; Sophie Stalla-Bourdillon
The current execution of privacy policies, as a mode of communicating information to users, is unsatisfactory. Social networking sites (SNS) exemplify this issue, attracting growing concerns regarding their use of personal data and its effect on user privacy. This demonstrates the need for more informative policies. However, SNS lack the incentives required to improve policies, which is exacerbated by the difficulties of creating a policy that is both concise and compliant. Standardization addresses many of these issues, providing benefits for users and SNS, although it is only possible if policies share attributes which can be standardized. This investigation used thematic analysis and cross-document structure theory, to assess the similarity of attributes between the privacy policies (as available in August 2014), of the six most frequently visited SNS globally. Using the Jaccard similarity coefficient, two types of attribute were measured; the clauses used by SNS and the coverage of forty recommendations made by the UK Information Commissioner's Office. Analysis showed that whilst similarity in the clauses used was low, similarity in the recommendations covered was high, indicating that SNS use different clauses, but to convey similar information. The analysis also showed that low similarity in the clauses was largely due to differences in semantics, elaboration and functionality between SNS. Therefore, this paper proposes that the policies of SNS already share attributes, indicating the feasibility of standardization and five recommendations are made to begin facilitating this, based on the findings of the investigation.
TopChurn: Maximum Entropy Churn Prediction Using Topic Models Over Heterogeneous Signals BIBAFull-Text 291-297
  Manirupa Das; Micha Elsner; Arnab Nandi; Rajiv Ramnath
With the onset of social media and news aggregators on the Web, the newspaper industry is faced with a declining subscriber base. In order to retain customers both on-line and in print, it is therefore critical to predict and mitigate customer churn. Newspapers typically have heterogeneous sources of valuable data: circulation data, customer subscription information, news content, and search click log data. An ensemble of predictive models over multiple sources faces unique challenges -- ascertaining short-term versus long-term effects of features on churn, and determining mutual information properties across multiple data sources. We present TopChurn, a novel system that uses topic models as a means of extracting dominant features from user complaints and Web data for churn prediction. TopChurn uses a maximum entropy-based approach to identify features that are most indicative of subscribers likely to drop subscription within a specified period of time. We conduct temporal analyses to determine long-term versus short-term effects of status changes on subscriber accounts, included in our temporal models of churn; and topic and sentiment analyses on news and clicklogs, included in our Web models of churn. We then validate our insights via experiments over real data from The Columbus Dispatch, a mainstream daily newspaper, and demonstrate that our churn models significantly outperform baselines for various prediction windows.
Deep Feelings: A Massive Cross-Lingual Study on the Relation between Emotions and Virality BIBAFull-Text 299-305
  Marco Guerini; Jacopo Staiano
This article provides a comprehensive investigation on the relations between virality of news articles and the emotions they are found to evoke. Virality, in our view, is a phenomenon with many facets, i.e. under this generic term several different effects of persuasive communication are comprised. By exploiting a high-coverage and bilingual corpus of documents containing metrics of their spread on social networks as well as a massive affective annotation provided by readers, we present a thorough analysis of the interplay between evoked emotions and viral facets. We highlight and discuss our findings in light of a cross-lingual approach: while we discover differences in evoked emotions and corresponding viral effects, we provide preliminary evidence of a generalized explanatory model rooted in the deep structure of emotions: the Valence-Arousal-Dominance (VAD) circumplex. We find that viral facets appear to be consistently affected by particular VAD configurations, and these configurations indicate a clear connection with distinct phenomena underlying persuasive communication.
User Behavior Characterization of a Large-scale Mobile Live Streaming System BIBAFull-Text 307-313
  Zhenyu Li; Gaogang Xie; Mohamed Ali Kaafar; Kave Salamatian
Streaming live content to mobile terminals has become prevalent. While there are extensive measurement studies of non-mobile live streaming (and in particular P2P live streaming) and video-on-demand (both mobile and non-mobile), user behavior in mobile live streaming systems is yet to be explored. This paper relies on over 4 million access logs collected from the PPTV live streaming system to study the viewing behavior and user activity pattern, with emphasis on the discrepancies that might exist when users access the live streaming system catalog from mobile and non-mobile terminals. We observe high rates of abandoned viewing sessions for mobile users and identify different reasons of that behavior for 3G- and WiFi-based views. We further examine the structure of abandoned sessions due to connection performance issues from the perspectives of time of day and mobile device types. To understand the user pattern, we analyze user activity distribution, user geographical distribution as well as user arrival/departure rates.
Identity Management and Mental Health Discourse in Social Media BIBAFull-Text 315-321
  Umashanthi Pavalanathan; Munmun De Choudhury
Social media is increasingly being adopted in health discourse. We examine the role played by identity in supporting discourse on socially stigmatized conditions. Specifically, we focus on mental health communities on reddit. We investigate the characteristics of mental health discourse manifested through reddit's characteristic 'throwaway' accounts, which are used as proxies of anonymity. For the purpose, we propose affective, cognitive, social, and linguistic style measures, drawing from literature in psychology. We observe that mental health discourse from throwaways is considerably disinhibiting and exhibits increased negativity, cognitive bias and self-attentional focus, and lowered self-esteem. Throwaways also seem to be six times more prevalent as an identity choice on mental health forums, compared to other reddit communities. We discuss the implications of our work in guiding mental health interventions, and in the design of online communities that can better cater to the needs of vulnerable populations. We conclude with thoughts on the role of identity manifestation on social media in behavioral therapy.
"Roles for the Boys?": Mining Cast Lists for Gender and Role Distributions over Time BIBAFull-Text 323-329
  Will Radford; Matthias Gallé
Film and television play an important role in popular culture. However studies that require watching and annotating video are time-consuming and expensive to run at scale. We explore information mined from media database cast lists to explore the evolution of different roles over time. We focus on the gender distribution of those roles and how this changes over time. Finally, we compare real-life census gender distributions to our web-mediated onscreen gender data. We propose these methodologies are a useful adjunct to traditional analysis that allow researchers to explore the relationship between online and onscreen gender depictions.
Improving Productivity in Citizen Science through Controlled Intervention BIBAFull-Text 331-337
  Avi Segal; Ya'akov (Kobi) Gal; Robert J. Simpson; Victoria Victoria Homsy; Mark Hartswood; Kevin R. Page; Marina Jirotka
The majority of volunteers participating in citizen science projects perform only a few tasks each before leaving the system. We designed an intervention strategy to reduce disengagement in 16 different citizen science projects. Targeted users who had left the system received emails that directly addressed motivational factors that affect their engagement. Results show that participants receiving the emails were significantly more likely to return to productive activity when compared to a control group.
Attention Please! A Hybrid Resource Recommender Mimicking Attention-Interpretation Dynamics BIBAFull-Text 339-345
  Paul Seitlinger; Dominik Kowald; Simone Kopeinik; Ilire Hasani-Mavriqi; Elisabeth Lex; Tobias Ley
Classic resource recommenders like Collaborative Filtering (CF) treat users as being just another entity, neglecting non-linear user-resource dynamics shaping attention and interpretation. In this paper, we propose a novel hybrid recommendation strategy that refines CF by capturing these dynamics. The evaluation results reveal that our approach substantially improves CF and, depending on the dataset, successfully competes with a computationally much more expensive Matrix Factorization variant.
Crowdsourcing the Annotation of Rumourous Conversations in Social Media BIBAFull-Text 347-353
  Arkaitz Zubiaga; Maria Liakata; Rob Procter; Kalina Bontcheva; Peter Tolmie
Social media are frequently rife with rumours, and the study of rumour conversational aspects can provide valuable knowledge about how rumours evolve over time and are discussed by others who support or deny them. In this work, we present a new annotation scheme for capturing rumour-bearing conversational threads, as well as the crowdsourcing methodology used to create high quality, human annotated datasets of rumourous conversations from social media. The rumour annotation scheme is validated through comparison between crowdsourced and reference annotations. We also found that only a third of the tweets in rumourous conversations contribute towards determining the veracity of rumours, which reinforces the need for developing methods to extract the relevant pieces of information automatically.
Viral Misinformation: The Role of Homophily and Polarization BIBFull-Text 355-356
  Alessandro Bessi; Fabio Petroni; Michela Del Vicario; Fabiana Zollo; Aris Anagnostopoulos; Antonio Scala; Guido Caldarelli; Walter Quattrociocchi
Modelling Question Selection Behaviour in Online Communities BIBAFull-Text 357-358
  Grégoire Burel; Paul Mulholland; Yulan He; Harith Alani
Value of online Question Answering (Q&A) communities is driven by the question-answering behaviour of its members. Finding the questions that members are willing to answer is therefore vital to the efficient operation of such communities. In this paper, we aim to identify the parameters that correlate with such behaviours. We train different models and construct effective predictions using various user, question and thread feature sets. We show that answering behaviour can be predicted with a high level of success.
Linked Ethnographic Data: From Theory to Practice BIBAFull-Text 359-360
  Dominic DiFranzo; Marie Joan Kristine Gloria; James Hendler
As Web Science continues to mix methods from the many disciplines that study the web, we must begin to seriously look at mixing and linking data across the Qualitative and Quantitative divide. A large difficulty in this is in modeling and archiving Qualitative data. In this paper, we outline what these difficulties are in detail with a focus on the data practices of Ethnography. We describe how linked data technologies can address these issues. We demonstrate this with a case study in modeling data from audio interviews that were taken in an ethnographic study conducted in our lab. We conclude with a discussion on future work that needs to be done to better equip researchers with these tools and methods.
Social Networking by Proxy: Analysis of Dogster, Catster and Hamsterster BIBAFull-Text 361-362
  Daniel Dünker; Jérôme Kunegis
Online pet social networks provide a unique opportunity to study an online social network in which a single user manages multiple user profiles, i.e. one for each pet they own. These types of multi-profile networks allow us to investigate two questions: (1) What is the relationship between the pet-level and human-level network, and (2) what is the relationship between friendship links and family ties? Concretely, we study the online pet social networks Catster, Dogster and Hamsterster, and show how the networks on the two levels interact, and perform experiments to find out whether knowledge about friendships on a profile-level alone can be used to predict which users are behind which profile.
Web as Corpus Supporting Natural Language Generation for Online River Information Communication BIBAFull-Text 363-364
  Xiwu Han; Antonio A. R. Ioris; Chenghua Lin
Web as corpus for NLP has been popular, and we now employed web as corpus for NLG, and made the online communication of tailored river information more effective and efficient. Evaluation and analysis shows that our generated texts were comparable to those written by domain experts and experienced users.
Modeling the Evolution of User-generated Content on a Large Video Sharing Platform BIBAFull-Text 365-366
  Rishabh Mehrotra; Prasanta Bhattacharya
Video sharing and entertainment websites have rapidly grown in popularity and now constitute some of the most visited websites on the Internet. Despite the high usage and user engagement, most of recent research on online media platforms have restricted themselves to networking based social media sites like Facebook or Twitter. The current study is among the first to perform a large-scale empirical study using longitudinal video upload data from one of the largest online video sites. Unlike previous studies in the online media space that have focused exclusively on demand-side research questions, we model the supply-side of the crowd contributed video ecosystem on this platform. The modeling and subsequent prediction of video uploads is made complicated by the heterogeneity of video types (e.g. popular vs. niche video genres), and the inherent time trend effects. We identify distinct genre-clusters from our dataset and employ a self-exciting Hawkes point-process model on each of these clusters to fully specify and estimate the video upload process. Our findings show that using a relatively parsimonious point-process model, we are able to achieve higher model fit, and predict video uploads to the platform with a higher accuracy than competing models.
Remix in 3D Printing: What your Sources say About You BIBAFull-Text 367-368
  Spiros Papadimitriou; Evangelos Papalexakis; Bin Liu; Hui Xiong
Concurrently with the recent, rapid adoption of 3D printing technologies, online sharing of 3D-printable designs is growing equally rapidly, even though it has received far less attention. We study remix relationships on Thingiverse, the dominant online repository and social network for 3D printing. We collected data of designs published over five years, and we find that remix ties exhibit both homophily and inverse-homophily across numerous key metrics, which is stronger compared to other kinds of social and content links. This may have implications on graph prediction tasks, as well as on the design of 3D-printable content repositories.
Using WikiProjects to Measure the Health of Wikipedia BIBAFull-Text 369-370
  Ramine Tinati; Markus Luczak-Roesch; Nigel Shadbolt; Wendy Hall
In this paper we examine WikiProjects, an emergent, community-driven feature of Wikipedia. We analysed 3.2 million Wikipedia articles associated with 618 active Wikipedia projects. The dataset contained the logs of over 115 million article revisions and 15 million talk entries both representing the activity of 15 million unique Wikipedians altogether. Our analysis revealed that per WikiProject, the number of article and talk contributions are increasing, as are the number of new Wikipedians contributing to individual WikiProjects. Based on these findings we consider how studying Wikipedia from a sub-community level may provide a means to measure Wikipedia activity.
Self Curation, Social Partitioning, Escaping from Prejudice and Harassment: The Many Dimensions of Lying Online BIBAFull-Text 371-372
  Max Van Kleek; Daniel Smith; Nigel R. Shadbolt; Dave Murray-Rust; Amy Guy
Portraying matters as other than they truly are is an important part of everyday human communication. In this paper, we use a survey to examine ways in which people fabricate, omit or alter the truth online. Many reasons are found, including creative expression, hiding sensitive information, role-playing, and avoiding harassment or discrimination. The results suggest lying is often used for benign purposes, and we conclude that its use may be essential to maintaining a humane online society.

Industrial Track

Constrained Optimization for Homepage Relevance BIBAFull-Text 375-384
  Deepak Agarwal; Shaunak Chatterjee; Yang Yang; Liang Zhang
This paper considers an application of showing promotional widgets to web users on the homepage of a major professional social network site. The types of widgets include address book invitation, group join, friends' skill endorsement and so forth. The objective is to optimize user engagement under certain business constraints. User actions on each widget may have very different downstream utilities, and quantification of such utilities can sometimes be quite difficult. Since there are multiple widgets to rank when a user visits, launching a personalized model to simply optimize user engagement such as clicks is often inappropriate. In this paper we propose a scalable constrained optimization framework to solve this problem. We consider several different types of constraints according to the business needs for this application. We show through both offline experiments and online A/B tests that our optimization framework can lead to significant improvement in user engagement while satisfying the desired set of business objectives.
The World Conversation: Web Page Metadata Generation From Social Sources BIBAFull-Text 385-395
  Omar Alonso; Sushma Bannur; Kartikay Khandelwal; Shankar Kalyanaraman
Over the past couple of years, social networks such as Twitter and Facebook have become the primary source for consuming information on the Internet. One of the main differentiators of this content from traditional information sources available on the Web is the fact that these social networks surface individuals' perspectives. When social media users post and share updates with friends and followers, some of those short fragments of text contain a link and a personal comment about the web page, image or video. We are interested in mining the text around those links for a better understanding of what people are saying about the object they are referring to. Capturing the salient keywords from the crowd is rich metadata that we can use to augment a web page. This metadata can be used for many applications like ranking signals, query augmentation, indexing, and for organizing and categorizing content. In this paper, we present a technique called social signatures that given a link to a web page, pulls the most important keywords from the social chatter around it. That is, a high level representation of the web page from a social media perspective. Our findings indicate that the content of social signatures differs compared to those from a web page and therefore provides new insights. This difference is more prominent as the number of link shares increase. To showcase our work, we present the results of processing a dataset that contains around 1 Billion unique URLs shared in Twitter and Facebook over a two month period. We also provide data points that shed some light on the dynamics of content sharing in social media.
Got Many Labels?: Deriving Topic Labels from Multiple Sources for Social Media Posts using Crowdsourcing and Ensemble Learning BIBAFull-Text 397-406
  Shuo Chang; Peng Dai; Jilin Chen; Ed H. Chi
Online search and item recommendation systems are often based on being able to correctly label items with topical keywords. Typically, topical labelers analyze the main text associated with the item, but social media posts are often multimedia in nature and contain contents beyond the main text. Topic labeling for social media posts is therefore an important open problem for supporting effective social media search and recommendation. In this work, we present a novel solution to this problem for Google+ posts, in which we integrated a number of different entity extractors and annotators, each responsible for a part of the post (e.g. text body, embedded picture, video, or web link). To account for the varying quality of different annotator outputs, we first utilized crowdsourcing to measure the accuracy of individual entity annotators, and then used supervised machine learning to combine different entity annotators based on their relative accuracy. Evaluating using a ground truth data set, we found that our approach substantially outperforms topic labels obtained from the main text, as well as naive combinations of the individual annotators. By accurately applying topic labels according to their relevance to social media posts, the results enables better search and item recommendation.
Question Classification by Approximating Semantics BIBAFull-Text 407-417
  Guangyu Feng; Kun Xiong; Yang Tang; Anqi Cui; Jing Bai; Hang Li; Qiang Yang; Ming Li
A central task of computational linguistics is to decide if two pieces of texts have similar meanings. Ideally, this depends on an intuitive notion of semantic distance. While this semantic distance is most likely undefinable and uncomputable, in practice it is approximated heuristically, consciously or unconsciously. In this paper, we present a theory, and its implementation, to approximate the elusive semantic distance by the well-defined information distance. It is mathematically proven that any computable approximation of the intuitive concept of semantic distance is "covered" by our theory. We have implemented our theory to question answering (QA) and performed experiments based on data extracted from over 35 million question-answer pairs. Experiments demonstrate that our initial implementation of the theory produces convincingly fewer errors in classification compared to other academic models and commercial systems.
Leveraging Careful Microblog Users for Spammer Detection BIBAFull-Text 419-429
  Hao Fu; Xing Xie; Yong Rui
Microblogging websites, e.g. Twitter and Sina Weibo, have become a popular platform for socializing and sharing information in recent years. Spammers have also discovered this new opportunity to unfairly overpower normal users with unsolicited content, namely social spams. While it is intuitive for everyone to follow legitimate users, recent studies show that both legitimate users and spammers follow spammers for different reasons. Evidence of users seeking for spammers on purpose is also observed. We regard this behavior as a useful information for spammer detection. In this paper, we approach the problem of spammer detection by leveraging the "carefulness" of users, which indicates how careful a user is when she is about to follow a potential spammer. We propose a framework to measure the carefulness, and develop a supervised learning algorithm to estimate it based on known spammers and legitimate users. We then illustrate how spammer detection can be improved in the aid of the proposed measure. Evaluation on a real dataset with millions of users and an online testing are performed on Sina Weibo. The results show that our approach indeed capture the carefulness, and it is effective to detect spammers. In addition, we find that the proposed measure is also beneficial for other applications, e.g. link prediction.
Topological Properties and Temporal Dynamics of Place Networks in Urban Environments BIBAFull-Text 431-441
  Anastasios Noulas; Blake Shaw; Renaud Lambiotte; Cecilia Mascolo
Understanding the spatial networks formed by the trajectories of mobile users can be beneficial to applications ranging from epidemiology to local search. Despite the potential for impact in a number of fields, several aspects of human mobility networks remain largely unexplored due to the lack of large-scale data at a fine spatiotemporal resolution. Using a longitudinal dataset from the location-based service Foursquare, we perform an empirical analysis of the topological properties of place networks and note their resemblance to online social networks in terms of heavy-tailed degree distributions, triadic closure mechanisms and the small world property. Unlike social networks however, place networks present a mixture of connectivity trends in terms of assortativity that are surprisingly similar to those of the web graph. We take advantage of additional semantic information to interpret how nodes that take on functional roles such as 'travel hub', or 'food spot' behave in these networks. Finally, motivated by the large volume of new links appearing in place networks over time, we formulate the classic link prediction problem in this new domain. We propose a novel variant of gravity models that brings together three essential elements of inter-place connectivity in urban environments: network-level interactions, human mobility dynamics, and geographic distance. We evaluate this model and find it outperforms a number of baseline predictors and supervised learning algorithms on a task of predicting new links in a sample of one hundred popular cities.
Synonym Discovery for Structured Entities on Heterogeneous Graphs BIBAFull-Text 443-453
  Xiang Ren; Tao Cheng
With the increasing use of entities in serving people's daily information needs, recognizing synonyms -- different ways people refer to the same entity -- has become a crucial task for many entity-leveraging applications. Previous works often take a "literal" view of the entity, i.e., its string name. In this work, we propose adopting a "structured" view of each entity by considering not only its string name, but also other important structured attributes. Unlike existing query log-based methods, we delve deeper to explore sub-queries, and exploit tailed synonyms and tailed web pages for harvesting more synonyms. A general, heterogeneous graph-based data model which encodes our problem insights is designed by capturing three key concepts (synonym candidate, web page and keyword) and different types of interactions between them. We cast the synonym discovery problem into a graph-based ranking problem and demonstrate the existence of a closed-form optimal solution for outputting entity synonym scores. Experiments on several real-life domains demonstrate the effectiveness of our proposed method.
Temporal Multi-View Inconsistency Detection for Network Traffic Analysis BIBAFull-Text 455-465
  Houping Xiao; Jing Gao; Deepak S. Turaga; Long H. Vu; Alain Biem
In this paper, we investigate the problem of identifying inconsistent hosts in large-scale enterprise networks by mining multiple views of temporal data collected from the networks. The time-varying behavior of hosts is typically consistent across multiple views, and thus hosts that exhibit inconsistent behavior are possible anomalous points to be further investigated. To achieve this goal, we develop an effective approach that extracts common patterns hidden in multiple views and detects inconsistency by measuring the deviation from these common patterns. Specifically, we first apply various anomaly detectors on the raw data and form a three-way tensor (host, time, detector) for each view. We then develop a joint probabilistic tensor factorization method to derive the latent tensor subspace, which captures common time-varying behavior across views. Based on the extracted tensor subspace, an inconsistency score is calculated for each host that measures the deviation from common behavior. We demonstrate the effectiveness of the proposed approach on two enterprise-wide network-based anomaly detection tasks. An enterprise network consists of multiple hosts (servers, desktops, laptops) and each host sends/receives a time-varying number of bytes across network protocols (e.g., TCP, UDP, ICMP) or send URL requests to DNS under various categories. The inconsistent behavior of a host is often a leading indicator of potential issues (e.g., instability, malicious behavior, or hardware malfunction). We perform experiments on real-world data collected from IBM enterprise networks, and demonstrate that the proposed method can find hosts with inconsistent behavior that are important to cybersecurity applications.

PhD Symposium

A Hybrid Approach to Perform Efficient and Effective Query Execution Against Public SPARQL Endpoints BIBAFull-Text 469-473
  Maribel Acosta
Linked Open Data initiatives have fostered the publication of Linked Data sets, as well as the deployment of publicly available SPARQL endpoints as client-server querying infrastructures to access these data sets. However, recent studies reveal that SPARQL endpoints may exhibit significant limitations in supporting real-world applications, and public linked data sets can suffer of quality issues, e.g., data can be incomplete or incorrect. We tackle these problems and propose a novel hybrid architecture that relies on shipping policies to improve the performance of SPARQL endpoints, and exploits human and machine query processing computation to enhance the quality of Linked Data sets. We report on initial empirical results that suggest that the proposed techniques overcome current drawbacks, and may provide a novel solution to make these promising infrastructures available for real-world applications.
A Taxonomy of Crowdsourcing Campaigns BIBAFull-Text 475-479
  Majid Ali AlShehry; Bruce Walker Ferguson
Crowdsourcing serves different needs of different sets of users. Most existing definitions and taxonomies of crowdsourcing address platform purpose while paying little attention to other parameters of this novel social phenomenon. In this paper, we analyze 41 crowdsourcing campaigns on 21 crowdsourcing platforms to derive 9 key parameters of successful crowdsourcing campaigns and introduce a comprehensive taxonomy of crowdsourcing. Using this taxonomy, we identify crowdsourcing trends in two parameters, platform purpose and contributor motivation. The paper highlights important advantages of using this conceptual model in planning crowdsourcing campaigns and concludes with a discussion of emerging challenges to such campaigns.
Discovering Credible Events in Near Real Time from Social Media Streams BIBAFull-Text 481-485
  Cody Buntain
My proposed research addresses fundamental deficiencies in social media-based event detection by discovering high-impact moments and evaluating their credibility rapidly. Results from my preliminary work demonstrate one can discover compelling moments by leveraging machine learning to characterize and detect bursts in keyword usage. Though this early work focused primarily on language-agnostic discovery in sporting events, it also showed promising results in adapting this work to earthquake detection. My dissertation will extend this research by adapting models to other types of high-impact events, exploring events with different temporal granularities, and finding methods to connect contextually related events into timelines. To ensure applicability of this research, I will also port these event discovery algorithms to stream processing platforms and evaluate their performance in the real-time context. To address issues of trust, my dissertation will also include developing algorithms that integrate the vast array of social media features to evaluate information credibility in near real time. Such features include structural signatures of information dissemination, the location from which a social media message was posted relative to the location of the event it describes, and metadata from related multimedia (e.g., pictures and video) shared about the event. My preliminary work also suggests methods that could be applied to social networks for stimulating trustworthy behavior and enhancing information quality. Contributions from my dissertation will primarily be practical algorithms for discovering events from various social media streams and algorithms for evaluating and enhancing the credibility of these events in near real time.
Ontology Search: Finding the Right Ontologies on the Web BIBAFull-Text 487-491
  Anila Sahar Butt
With the recent growth of Linked Data on the Web, there is an increased need for knowledge engineers to find ontologies to describe their data. Only limited work exists that addresses the problem of searching and ranking ontologies based on keyword queries. In this proposal we introduce the main challenges to find appropriate ontologies, and preliminary solutions to address these challenges. Our evaluation shows that the proposed solution performs significantly better than existing solutions on a benchmark ontology collection for the majority of the sample queries defined in the benchmark.
Make Hay While the Crowd Shines: Towards Efficient Crowdsourcing on the Web BIBAFull-Text 493-497
  Ujwal Gadiraju
Within the scope of this PhD proposal, we set out to investigate two pivotal aspects that influence the effectiveness of crowdsourcing: (i) microtask design, and (ii) workers behavior. Leveraging the dynamics of tasks that are crowdsourced on the one hand, and accounting for the behavior of workers on the other hand, can help in designing tasks efficiently. To help understand the intricacies of microtasks, we identify the need for a taxonomy of typically crowdsourced tasks. Based on an extensive study of 1000 workers on CrowdFlower, we propose a two-level categorization scheme for tasks. We present insights into the task affinity of workers, effort exerted by workers to complete tasks of various types, and their satisfaction with the monetary incentives. We also analyze the prevalent behavior of trustworthy and untrustworthy workers. Next, we propose behavioral metrics that can be used to measure and counter malicious activity in crowdsourced tasks. Finally, we present guidelines for the effective design of crowdsourced surveys and set important precedents for future work.
Mining Scholarly Communication and Interaction on the Social Web BIBAFull-Text 499-503
  Asmelash Teka Hadgu
The explosion of Web 2.0 platforms including social networking sites such as Twitter, blogs and wikis affects all web users: scholars included. As a result, there is a need for a comprehensive approach to gain a broader understanding and timely signals of scientific communication as well as how researchers interact on the social web. Most current work in this area deals with either a low number of researchers and heavily relies on manual annotation or large-scale analysis without deep understanding of the underlying researcher population. In this proposal, we present a holistic approach to solve these problems. This research proposes novel methods to collect, filter, analyze and make sense of scholars and scholarly communication by integrating heterogeneous data sources from fast social media streams as well as the academic web. Applying reproducible research, contributing applications and data sets, the thesis proposal strives to add value by mining the social web for social good.
Modeling Cognitive Processes in Social Tagging to Improve Tag Recommendations BIBAFull-Text 505-509
  Dominik Kowald
With the emergence of Web 2.0, tag recommenders have become important tools, which aim to support users in finding descriptive tags for their bookmarked resources. Although current algorithms provide good results in terms of tag prediction accuracy, they are often designed in a data-driven way and thus, lack a thorough understanding of the cognitive processes that play a role when people assign tags to resources. This thesis aims at modeling these cognitive dynamics in social tagging in order to improve tag recommendations and to better understand the underlying processes. As a first attempt in this direction, we have implemented an interplay between individual micro-level (e.g., categorizing resources or temporal dynamics) and collective macro-level (e.g., imitating other users' tags) processes in the form of a novel tag recommender algorithm. The preliminary results for datasets gathered from BibSonomy, CiteULike and Delicious show that our proposed approach can outperform current state-of-the-art algorithms, such as Collaborative Filtering, FolkRank or Pairwise Interaction Tensor Factorization. We conclude that recommender systems can be improved by incorporating related principles of human cognition.
Social Media as Firm's Network and Its Influence on the Corporate Performance BIBAFull-Text 511-514
  Jeongwoo Oh
Social media or social network service has attracted a great amount of interest of applied studies, because we can see how people connect, behave, and interact each other through it even at a glance. Individual usage of social media can be viewed from corporate level in the market. This paper starts from such interest as well, trying to verify the applicability of social media in the corporate finance study. The basic question is whether social media interaction of the firm or firm's executive can affect the performance of the corresponding firm. In the study of economics and finance, firm level network has been studied in different contexts, mainly involved with economic benefit. However, the online network has not been enough studied regarding its effect on the corporate performance. In general, firm's decision making process has been regarded as exclusive and confidential, rather than publicly observable, which resulted in focusing more on the closed network in person or between related firms. But we observe that many top executives are already active or even stars on the social media. Therefore, we take a close look at this to determine if networking on social media is personal activity or corporate behavior. In other words, we are interested in whether the internet-based life style with social media can possibly influence on the corporate performance in the market or the firm-level decision making process. We investigate this question by using both social media and market data with firm information. First of all, we identify the determinants of social network behavior of the firms' executives. And next, we estimate the value of social media network on the corporate performance, calculating abnormal returns and analyzing its dynamics in the long term. Finally, we also verify the value of social media network in the context of executives' personal compensation. We expect that this research will provide a new insight about the social network on the corporate performance by adopting online social media as a network variable. It is also expected to broaden the applicability of the online network data to the academic questions in finance research.
A Hybrid Framework for Online Execution of Linked Data Queries BIBAFull-Text 515-519
  Mohamed M. Sabri
Linked Data has been widely adopted over the last few years, with the size of the Linked Data cloud almost doubling every year. However, there is still no well-defined, efficient mechanism for querying such a Web of Data. We propose a framework that incorporates a set of optimizations to tackle various limitations in the state-of-the-art. The framework aims at combining the centralized query optimization capabilities of the data warehouse-based approaches with the result freshness and explorative data source discovery capabilities of link-traversal approaches. This is achieved by augmenting base-line link-traversal query execution with a set of optimization techniques. The proposed optimizations fall under two categories: metadata-based optimizations and semantics-based optimizations.

AW4CITY 2015

Comparing Smart Cities with different modeling approaches BIBAFull-Text 525-528
  Leonidas G. Anthopoulos; Marijn Janssen; Vishanth Weerakkody
Smart cities have attracted an extensive and increasing interest from both science and industry with an increasing number of international examples emerging from across the world. However, despite the significant role that smart cities can play to deal with recent urban challenges, the concept has been criticized for being influenced by vendor hype. There are various attempts to conceptualize smart cities and various benchmarking methods have been developed to evaluate their impact. In this paper the modelling and benchmarking approaches are systematically compared. There are six common dimensions among the approaches, namely people, government, economy, mobility, environment and living. This paper utilizes existing smart city analysis models in order to review three representative smart city cases and useful outcomes are extrapolated from this comparison.
Understanding Smart City Business Models: A Comparison BIBAFull-Text 529-534
  Leonidas G. Anthopoulos; Panos Fitsilis
Smart cities have attracted the international scientific and business attention and a niche market is being evolved, which engages almost all the business sectors. In their attempt to empower and promote urban competitive advantages, local governments have approached the smart city context and they target habitants, visitors and investments. However, engaging the smart city context is not free-of-charge and corresponding investments are extensive and of high risk without the appropriate management. Moreover, investing in the smart city domain does not secure corresponding mission success and both governments and vendors require more effective instruments. This paper performs an investigation on the smart city business models and is a work in progress. Modeling can illustrate where corresponding profit comes from and how it flows, while a significant business model portfolio is eligible for smart city stakeholders.
An Urban Fault Reporting and Management Platform for Smart Cities BIBAFull-Text 535-540
  Sergio Consoli; Diego Reforgiato Recupero; Misael Mongiovi; Valentina Presutti; Gianni Cataldi; Wladimiro Patatu
A good interaction between public administrations and citizens is imperative in modern smart cities. Semantic web technologies can aid in achieving such a goal. We present a smart urban fault reporting web platform to help citizens in reporting common urban problems, such as street faults, potholes or broken street lights, and to support the local public administration in responding and fixing those problems quickly. The tool is based on a semantic data model designed for the city, which integrates several distinct data sources, opportunely re-engineered to meet the principles of the Semantic Web and linked open data. The platform supports the whole process of road maintenance, from the fault reporting to the management of maintenance activities. The integration of multiple data sources enables increasing interoperability and heterogeneous information retrieval, thus favoring the development of effective smart urban fault reporting services. Our platform was evaluated in a real case study: a complete urban reporting and road maintenance system has been developed for the municipality of Catania. Our approach is completely generalizable and can be adopted by and customized for other cities. The final goal is to stimulate smart maintenance services in the "cities of the future".
Supporting the Development of Smart Cities using a Use Case Methodology BIBAFull-Text 541-545
  Marion Gottschalk; Mathias Uslar
Urbanization grows steadily, i.e. more humans live at one place and rural areas are more unpopular. Urbanization faces challenges for city planning and development. Cities have to deal with large crowds, high energy consumption, large quantities of garbage etc. Thus, smart cities have to meet many requirements of different areas. Hence, realizing smart cities can be supported by linking different smart areas, such as smart girds and smart homes, to one large area. The linking is done by information and communication technologies, which are supported through a clear definition of functionalities and interfaces. Smart cities and further smart areas are under development, so, it is difficult to depict an overview on their functionalities, yet. Therefore, the two approaches, use case methodology and integration profiles, are introduced in this work, which are also realized by a web-based application.
Innovative IoT-aware Services for a Smart Museum BIBAFull-Text 547-550
  Vincenzo Mighali; Giuseppe Del Fiore; Luigi Patrono; Luca Mainetti; Stefano Alletto; Giuseppe Serra; Rita Cucchiara
Smart cities are a trading topic in both the academic literature and industrial world. The capability to provide the users with added-value services through low-power and low-cost smart objects is very attractive in many fields. Among these, art and culture represent very interesting examples, as the tourism is one of the main driving engines of modern society. In this paper, we propose an IoT-aware architecture to improve the cultural experience of the user, by involving the most important recent innovations in the ICT field. The main components of the proposed architecture are: (i) an indoor localization service based on the Bluetooth Low Energy technology, (ii) a wearable device able to capture and process images related to the user's point of view, (iii) the user's mobile device useful to display customized cultural contents and to share multimedia data in the Cloud, and (iv) a processing center that manage the core of the whole business logic. In particular, it interacts with both wearable and mobile devices, and communicates with the outside world to retrieve contents from the Cloud and to provide services also to external users. The proposal is currently under development and it will be validated in the MUST museum in Lecce.
Design of Interactional End-to-End Web Applications for Smart Cities BIBAFull-Text 551-556
  Erich Ortner; Marco Mevius; Peter Wiedmann; Florian Kurz
Nowadays, the number of flexible and fast human to application system interactions is dramatically increasing. For instance, citizens interact with the help of the internet to organize surveys or meetings (in real-time) spontaneously. These interactions are supported by technologies and application systems such as free wireless networks, web -or mobile apps. Smart Cities aim at enabling their citizens to use these digital services, e.g., by providing enhanced networks and application infrastructures maintained by the public administration. However, looking beyond technology, there is still a significant lack of interaction and support between "normal" citizens and the public administration. For instance, democratic decision processes (e.g. how to allocate public disposable budgets) are often discussed by the public administration without citizen involvement. This paper introduces an approach, which describes the design of enhanced interactional web applications for Smart Cities based on dialogical logic process patterns. We demonstrate the approach with the help of a budgeting scenario as well as a summary and outlook on further research.
Smart Cities Governance Informatability?: Let's First Understand the Atoms BIBAFull-Text 557-562
  Alois Paulin
In this paper we search for and analyze the atomic components of general governance systems and discuss whether or not they can be informated, i.e. tangibly represented within the digital realm of information systems. We draw a framework based on the theories of Downs, Jellinek, and Hohfeld and find that the therein identified atomic components cannot be informated directly, but only indirectly, due to the inherent complexity of governance. We outline pending research questions to be addressed in the future.
Towards Personalized Smart City Guide Services in Future Internet Environments BIBAFull-Text 563-568
  Robert Seeliger; Christopher Krauss; Annette Wilson; Miggi Zwicklbauer; Stefan Arbanowski
The FI-CONTENT project aims at establishing the foundation of a European infrastructure for developing and testing novel smart city services. The Smart City Services Platform will develop enabling technology for SMEs and developer to create services offering residents and visitors to cities smart services that enhance their city visit or daily life. We have made use of generic, specific and common enablers to develop a reference implementation, the Smart City Guide web app. The basic information is provided by the Open City Database, an open source specific enabler that can be used for any city in Europe. Recommendation as a Service is an enabler that can be applied to lots use cases, here we describe how we integrated it into the Smart City Guide. The uses cases will be iteratively improved and upgraded during regular iterative cycles based on feedback gained in lab and field trials at the experimentation sites. As the app is transferable to any city, it will be tested at a number of experimentation sites.
A Universal Design Infrastructure for Multimodal Presentation of Materials in STEM Programs: Universal Design BIBAFull-Text 569-574
  Leyla Zhuhadar; Bryan Carson; Jerry Daday; Olfa Nasraoui
We describe a proposed universal design infrastructure that aims at promoting better opportunities for students with disabilities in STEM programs to understand multimedia teaching material. The Accessible Educational STEM Videos Project aims to transform learning and teaching for students with disabilities through integrating synchronized captioned educational videos into undergraduate and graduate STEM disciplines. This Universal Video Captioning (UVC) platform will serve as a repository for uploading videos and scripts. The proposed infrastructure is a web-based platform that uses the latest WebDAV technology (Web-based Distributed Authoring and Versioning) to identify resources, users, and content. It consists of three layers: (i) an administrative management system; (ii) a faculty/staff user interface; and (iii) a transcriber user interface. We anticipate that by enriching it with captions or transcripts, the multimodal presentation of materials promises to help students with disabilities in STEM programs master the subject better and increase retention.

BigScholar 2015

The Knowledge Web Meets Big Scholars BIBAFull-Text 577-578
  Kuansan Wang
Human is the only species on earth that has mastered the technologies in writing and printing to capture ephemeral thoughts and scientific discoveries. The capabilities to pass along knowledge, not only geographically but also generationally, have formed the bedrock of our civilizations. We are in the midst of a silent revolution driven by the technological advancements: no longer are computers just a fixture of our physical world but have they been so deeply woven into our daily routines that they are now occupying the center of our lives. No where are the phenomena more prominent than our reliance on the World Wide Web. More and more often, the web has become the primary source of fresh information and knowledge. In addition to general consumption, the availability of large amount of contents and behavioral data has also instigated new interdisciplinary research activities in the areas of information retrieval, natural language processing, machine learning, behavioral studies, social computing and data mining. This talk will use web search as an example to demonstrate how these new research activities and technologies have help the web evolve from a collection of documents to becoming the largest knowledge base in our history. During this evolution, the web is transformed from merely reacting to our needs to a living entity that can anticipate and push timely information to wherever and whenever we need it. How the scholarly activities and communications can be impacted will also be illustrated and elaborated, and some observations derived from a web scale data set, newly release to the public, will also be shared.
AVER: Random Walk Based Academic Venue Recommendation BIBAFull-Text 579-584
  Zhen Chen; Feng Xia; Huizhen Jiang; Haifeng Liu; Jun Zhang
Academic venues act as the main platform of communities in academia and the bridge of connecting researchers, which have rapidly developed in recent years. However, information overload in big scholarly data creates tremendous challenges for mining useful and effective information in order to recommend researchers to acknowledge high quality and fruitful academic venues, thereby enabling them to participate in relevant academic conferences as well as contributing to important/influential journals. In this work, we propose AVER, a novel random walk based Academic VEnue Recommendation model. AVER runs a random walk with restart model on a co-publication network which contains two kinds of associations, coauthor relations and author-venue relations. Moreover, we define a transfer matrix with bias to drive the random walk by exploiting three academic factors, co-publication frequency, weight of relations and researchers' academic level. AVER is inspired from the fact that researchers are more likely to contact those who have high co-publication frequency and similar academic levels. Additionally, in AVER, we consider the difference of weights between two kinds of associations. We conduct extensive experiments on DBLP data set in order to evaluate the performance of AVER. The results demonstrate that, in comparison to relevant baseline approaches, AVER performs better in terms of precision, recall and F1.
Discovering the Rise and Fall of Software Engineering Ideas from Scholarly Publication Data BIBAFull-Text 585-590
  Subhajit Datta; Santonu Sarkar; A. S. M. Sajeev; Nishant Kumar
For researchers and practitioners of a relatively young discipline like software engineering, an enduring concern is to identify the acorns that will grow into oaks -- ideas remaining most current in the long run. Additionally, it is interesting to know how the ideas have risen in importance, and fallen, perhaps to rise again. We analyzed a corpus of 19,000+ papers written by 21,000+ authors across 16 software engineering publication venues from 1975 to 2010, to empirically determine the half-life of software engineering research topics. We adapted existing measures of half-life as well as defined a specific measure based on publication and citation counts. The results from this empirical study are a presented in this paper.
Science Navigation Map: an Interactive Data Mining Tool for Literature Analysis BIBAFull-Text 591-596
  Yu Liu; Zhen Huang; Yizhou Yan; Yufeng Chen
With the advances of all research fields and web 2.0, scientific literature has been widely observed in digital libraries, citation databases, and social media. Its new properties, such as large volume, wide exhibition, and the complicated citation relationship in papers bring challenges to the management, analysis and exploring knowledge of scientific literature. In addition, although data mining techniques have been imported to scientific literature analysis tasks, they typically requires expert input and guidance, and returns static results to users after process, which makes them inflexible and not smart. Therefore, there is the need of a tool, which highly reflects article-level-metrics and combines human users and computer systems for analysis and exploring knowledge of scientific literature, as well as discovering and visualizing underlying interesting research topics. We design an online tool for literature navigation, filtering, and interactive data mining, named Science Navigation Map (SNM), which integrates information from online paper repositories, citation databases, etc. SNM provides visualization of article level metrics and interactive data mining which takes advantage of effective interaction between human users and computer systems to explore and extract knowledge from scientific literature and discover underlying interesting research topics. We also propose a multi-view non-negative matrix factorization and apply it to SNM as an interactive data mining tool, which can make better use of complicated multi-wise relationships in papers. In experiments, we visualize all the papers published at the journal of PLOS Biology from 2003 to 2012 in the navigation map and explore six relationship in papers for data mining. From this map, one can easily filter, analyse and explore knowledge of the papers through an interactive way.
Big Scholarly Data in CiteSeerX: Information Extraction from the Web BIBAFull-Text 597-602
  Alexander G., II Ororbia; Jian Wu; Madian Khabsa; Kyle Williams; Clyde Lee Giles
We examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing large-scale collections of scholarly documents from the world wide web. From the perspective of automatic information extraction and modes of alternative search, we examine various functional aspects of this complex system with an eye towards ongoing and future research developments.
Using Reference Groups to Assess Academic Productivity in Computer Science BIBAFull-Text 603-608
  Sabir Ribas; Berthier Ribeiro-Neto; Edmundo de Souza e Silva; Alberto Hideki Ueda; Nivio Ziviani
In this paper we discuss the problem of how to assess academic productivity based on publication outputs. We are interested in knowing how well a research group in an area of knowledge is doing relatively to a pre-selected set of reference groups, where each group is composed by academics or researchers. To assess academic productivity we adopt a new metric we propose, which we call P-score. We use P-score, citation counts and H-Index to obtain rankings of researchers in Brazil. Experimental results using data from the area of Computer Science show that P-score outperforms citation counts and H-Index when assessed against the official ranking produced by the Brazilian National Research Council (CNPq). This is of our interest for two reasons. First, it suggests that citation-based metrics, despite wide adoption, can be improved upon. Second, contrary to citation-based metrics, the P-score metric does not require access to the content of publications to be computed.
Modeling and Analysis of Scholar Mobility on Scientific Landscape BIBAFull-Text 609-614
  Qiu Fang Ying; Srinivasan Venkatramanan; Dah Ming Chiu
Scientific literature till date can be thought of as a partially revealed landscape, where scholars continue to unveil hidden knowledge by exploring novel research topics. How do scholars explore the scientific landscape, i.e., choose research topics to work on? We propose an agent-based model of topic mobility behavior where scholars migrate across research topics on the space of science following different strategies, seeking different utilities. We use this model to study whether strategies widely used in current scientific community can provide a balance between individual scientific success and the efficiency and diversity of the whole academic society. Through extensive simulations, we provide insights into the roles of different strategies, such as choosing topics according to research potential or the popularity. Our model provides a conceptual framework and a computational approach to analyze scholars' behavior and its impact on scientific production. We also discuss how such an agent-based modeling approach can be integrated with big real-world scholarly data.

DAEN 2015

The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk BIBAFull-Text 617
  Djellel Eddine Difallah; Michele Catasta; Gianluca Demartini; Panagiotis G. Ipeirotis; Philippe Cudré-Mauroux
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.
   In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, tasks, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
Co-evolutionary Dynamics of Information Diffusion and Network Structure BIBAFull-Text 619-620
  Mehrdad Farajtabar; Manuel Gomez-Rodriguez; Yichen Wang; Shuang Li; Hongyuan Zha; Le Song
Information diffusion in online social networks is obviously affected by the underlying network topology, but it also has the power to change that topology. Online users are constantly creating new links when exposed to new information sources, and in turn these links are alternating the route of information spread. However, these two highly intertwined stochastic processes, information diffusion and network evolution, have been predominantly studied separately, ignoring their co-evolutionary dynamics. In this project, we propose a probabilistic generative model, COEVOLVE, for the joint dynamics of these two processes, allowing the intensity of one process to be modulated by that of the other. This model allows us to efficiently simulate diffusion and network events from the co-evolutionary dynamics, and generate traces obeying common diffusion and network patterns observed in real-world networks. Furthermore, we also develop a convex optimization framework to learn the parameters of the model from historical diffusion and network evolution traces. We experimented with both synthetic data and data gathered from Twitter, and show that our model provides a good fit to the data as well as more accurate predictions than alternatives.
Microscopic Description and Prediction of Information Diffusion in Social Media: Quantifying the Impact of Topical Interests BIBAFull-Text 621-622
  Przemyslaw Grabowicz; Niloy Ganguly; Krishna Gummadi
A number of recent studies of information diffusion in social media, both empirical and theoretical, have been inspired by viral propagation models derived from epidemiology. These studies model propagation of memes, i.e., pieces of information, between users in a social network similarly to the way diseases spread in human society. Naturally, many of these studies emphasize social exposure, i.e., the number of friends or acquaintances of a user that have exposed a meme to her, as the primary metric for understanding, predicting, and controlling information diffusion. Intuitively, one would expect a meme to spread in a social network selectively, i.e., amongst the people who are interested in the meme. However, the importance of the alignment between the topicality of a meme and the topical interests of the potential adopters and influencers in the network has been less explored in the literature. In this paper, we quantify the impact of the topical alignment between memes and users on their adoption. Our analysis, using empirical data about two different types of memes, i.e., hashtags and URLs spreading through the Twitter social media platform, finds that topical alignment between memes and users is as crucial as the social exposure in understanding and predicting meme adoptions. Our results emphasize the need to look beyond social network-based viral propagation models and develop microscopic models of information diffusion that account for interests of users and topicality of information.
Scalable Methods for Adaptively Seeding a Social Network BIBAFull-Text 623-624
  Thibaut Horel; Yaron Singer
In many applications of influence maximization, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of social networks, these users often rank low in terms of their influence potential. To alleviate this issue, one can consider an adaptive method which selects users in a manner which targets their influential neighbors. The advantage of such an approach is that it leverages the friendship paradox in social networks: while users are often not influential, they often know someone who is. Despite the various complexities in such optimization problems, we show that scalable adaptive seeding is achievable. To show the effectiveness of our methods we collected data from various verticals social network users follow, and applied our methods on it. Our experiments show that adaptive seeding is scalable, and that it obtains dramatic improvements over standard approaches of information dissemination.
Inferring Graphs from Cascades: A Sparse Recovery Framework BIBAFull-Text 625-626
  Jean Pouget-Abadie; Thibaut Horel
In the Graph Inference problem, one seeks to recover the edges of an unknown graph from the observations of cascades propagating over this graph. We approach this problem from the sparse recovery perspective. We introduce a general model of cascades, including the voter model and the independent cascade model, for which we provide the first algorithm which recovers the graph's edges with high probability and O(s log m) measurements where s is the maximum degree of the graph and m is the number of nodes. Furthermore, we show that our algorithm also recovers the edge weights (the parameters of the diffusion process) and is robust in the context of approximate sparsity. Finally we validate our approach empirically on synthetic graphs.

KET 2015

A Hitchhiker's Guide to Ontology BIBAFull-Text 629
  Fabian M. Suchanek
In this talk, I will present our recent work in the area of knowledge bases. It covers 4 areas of research around ontologies and knowledge bases: The first area is the construction of the YAGO knowledge base. YAGO is now multilingual, and has grown into a larger project at the Max Planck Institute for Informatics and Télécom ParisTech. The second area is the alignment of knowledge bases. This includes the alignment of classes, instances, and relations across knowledge bases. The third area is rule mining. Our project finds semantic correlations in the form of Horn rules in the knowledge base. I will also talk about watermarking approaches to trace the provenance of ontological data. Finally, I will show applications of the knowledge base for mining news corpora.
Isaac Bloomberg Meets Michael Bloomberg: Better EntityDisambiguation for the News BIBAFull-Text 631-635
  Luka Bradesko; Janez Starc; Stefano Pacifico
This paper shows the implementation and evaluation of the Entity Linking or Named Entity Disambiguation system used and developed at Bloomberg. In particular, we present and evaluate a methodology and a system that do not require the use of Wikipedia as a knowledge base or training corpus. We present how we built features for disambiguation algorithms from the Bloomberg News corpus, and how we employed them for both single-entity and joint-entity disambiguation into a Bloomberg proprietary knowledge base of people and companies. Experimental results show high quality in the disambiguation of the available annotated corpus.
A Two-Iteration Clustering Method to Reveal Unique and Hidden Characteristics of Items Based on Text Reviews BIBAFull-Text 637-642
  Alon Dayan; Osnat Mokryn; Tsvi Kuflik
This paper presents a new method for extracting unique features of items based on their textual reviews. The method is built of two similar iterations of applying a weighting scheme and then clustering the resultant set of vectors. In the first iteration, restaurants of similar food genres are grouped together into clusters. The second iteration reduces the importance of common terms in each such cluster, and highlights those that are unique to each specific restaurant. Clustering the restaurants again, now according to their unique features, reveals very interesting connections between the restaurants.
Knowledge Obtention Combining Information Extraction Techniques with Linked Data BIBAFull-Text 643-648
  Angel Luis Garrido; Pilar Blazquez; Maria G. Buey; Sergio Ilarri
Today, we can find a vast amount of textual information stored in proprietary data stores. The experience of searching information in these systems could be improved in a remarkable manner if we combine these private data stores with the information supplied by the Internet, merging both data sources to get new knowledge. In this paper, we propose an architecture with the goal of automatically obtaining knowledge about entities (e.g., persons, places, organizations, etc.) from a set of natural text documents, building smart data from raw data. We have tested the system in the context of the news archive of a real Media Group.
Topic and Sentiment Unification Maximum Entropy Model for Online Review Analysis BIBAFull-Text 649-654
  Changlin Ma; Meng Wang; Xuewen Chen
Opinion mining is an important research topic in data mining. Many current methods are coarse-grained, which are practically problemic due to insufficient feedback information and limited reference values. To address these problems, a novel topic and sentiment unification maximum entropy LDA model is proposed in this paper for fine-grained opinion mining of online reviews. In this model, a maximum entropy component is first added to the traditional LDA model to distinguish background words, aspect words and opinion words and further realize both the local and global extraction of these words. A sentiment layer is then inserted between a topic layer and a word layer to extend the proposed model to four layers. Sentiment polarity analysis is done based on the extraction of aspect words and opinion words to simultaneously acquire the sentiment polarity of the whole review and each topic, which leads to, fine-grained topic-sentiment abstract. Experimental results demonstrate the validity of the proposed model and theory.
Tree Kernel-based Protein-Protein Interaction Extraction Considering both Modal Verb Phrases and Appositive Dependency Features BIBAFull-Text 655-660
  Changlin Ma; Yong Zhang; Maoyuan Zhang
Protein-protein interaction plays an important role in understanding biological processes. In order to resolve the parsing error resulted from modal verb phrases and the noise interference brought by appositive dependency, an improved tree kernel-based PPI extraction method is proposed in this paper. Both modal verbs and appositive dependency features are considered to define some relevant processing rules which can effectively optimize and expand the shortest dependency path between two proteins in the new method. On the basis of these rules, the effective optimization and expanding path is used to direct the cutting of constituent parse tree, which makes the constituent parse tree for protein-protein interaction extraction more precise and concise. The experimental results show that the new method achieves better results on five commonly used corpora.
A Rule-Based Approach to Extracting Relations from Music Tidbits BIBAFull-Text 661-666
  Sergio Oramas; Mohamed Sordo; Luis Espinosa-Anke
This paper presents a rule based approach to extracting relations from unstructured music text sources. The proposed approach identifies and disambiguates musical entities in text, such as songs, bands, persons, albums and music genres. Candidate relations are then obtained by traversing the dependency parsing tree of each sentence in the text with at least two identified entities. A set of syntactic rules based on part of speech tags are defined to filter out spurious and irrelevant relations. The extracted entities and relations are finally represented as a knowledge graph. We test our method on texts from songfacts.com, a website that provides tidbits with facts and stories about songs. The extracted relations are evaluated intrinsically by assessing their linguistic quality, as well as extrinsically by assessing the extent to which they map an existing music knowledge base. Our system produces a vast percentage of linguistically correct relations between entities, and is able to replicate a significant part of the knowledge base.
An Architecture for Information Extraction from Figures in Digital Libraries BIBAFull-Text 667-672
  Sagnik Ray Choudhury; Clyde Lee Giles
Scholarly documents contain multiple figures representing experimental findings. These figures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such figures. Our architecture consists of the following modules: 1. An extractor for figures and associated metadata (figure captions and mentions) from PDF documents; 2. A Search engine on the extracted figures and metadata; 3. An image processing module for automated data extraction from the figures and 4. A natural language processing module to understand the semantics of the figure. We discuss the challenges in each step, report an extractor algorithm to extract vector graphics from scholarly documents and a classification algorithm for figures. Our extractor algorithm improves the state of the art by more than 10% and the classification process is very scalable, yet achieves 85% accuracy. We also describe a semi-automatic system for data extraction from figures which is integrated with our search engine to improve user experience.
Semantic Construction Grammar: Bridging the NL / Logic Divide BIBAFull-Text 673-678
  Dave Schneider; Michael J. Witbrock
In this paper, we discuss Semantic Construction Grammar (SCG), a system developed over the past several years to facilitate translation between natural language and logical representations. Crucially, SCG is designed to support a variety of different methods of representation, ranging from those that are fairly close to the NL structure (e.g. so-called "logical forms"), to those that are quite different from the NL structure, with higher-order and high-arity relations. Semantic constraints and checks on representations are integral to the process of NL understanding with SCG, and are easily carried out due to the SCG's integration with Cyc's Knowledge Base and inference engine [1], [2].
Linking Stanford Typed Dependencies to Support Text Analytics BIBAFull-Text 679-684
  Fouad Zablith; Ibrahim H. Osman
With the daily increase of the amount of published information, research in the area of text analytics is gaining more visibility. Text processing for improving analytics is being studied from different angles. In the literature, text dependencies have been employed to perform various tasks. This includes for example the identification of semantic relations and sentiment analysis. We observe that while text dependencies can boost text analytics, managing and preserving such dependencies in text documents that spread across various corpora and contexts is a challenging task. We present in this paper our work on linking text dependencies using the Resource Description Framework (RDF) specification, following the Stanford typed dependencies representation. We contribute to the field by providing analysts the means to query, extract, and reuse text dependencies for analytical purposes. We highlight how this additional layer can be used in the context of feedback analysis by applying a selection of queries passed to a triple-store containing the generated text dependencies graphs.

LiLE 2015

LRMI, Learning Resource Metadata on the Web BIBAFull-Text 687
  Phil Barker; Lorna M. Campbell
The Learning Resource Metadata Initiative (LRMI) is a collaborative initiative that aims to make it easier for teachers and learners to find educational materials through major search engines and specialized resource discovery services. The approach taken by LRMI is to extend the schema.org ontology so that educationally significant characteristics and relationships can be expressed. This, of course, builds on a long history developing metadata standards for learning resources. The context for LRMI, however, is different to these in several respects. LRMI builds on schema.org, and schema.org is designed as a means for marking up web pages to make them more intelligible to search engines; the aim is for it to be present in a significant proportion of pages on the web, that is, implemented at scale not just by metadata professionals. LRMI may have applications that go beyond the core aims of schema.org: it is possible to create LRMI metadata that is independent of a web page for example as JSON-LD records or as EPUB3 metadata.
   The approach of extending schema.org has several advantages, starting with the ability to focus on how best to describe the educational characteristics of resources while others focus on other specialist aspects of the resource description. It also means that LRMI benefits from all the effort that goes into developing tools and community resources for schema.org. There are still some challenges for LRMI, one which is particularly pertinent is that of describing educational frameworks (e.g. common curricula or educational levels) to which the learning resources align. LRMI has developed the means for expressing an alignment statement such as "this resource is useful for teaching subject X" but we need more work on how to refer to the subject in that statement. This is challenge that conventional linked data for education could address.
TinCan2PROV: Exposing Interoperable Provenance of Learning Processes through Experience API Logs BIBAFull-Text 689-694
  Tom De Nies; Frank Salliau; Ruben Verborgh; Erik Mannens; Rik Van de Walle
A popular way to log learning processes is by using the Experience API (abbreviated as xAPI), also referred to as Tin Can. While Tin Can is great for developers who need to log learning experiences in their applications, it is more challenging for data processors to interconnect and analyze the resulting data. An interoperable data model is missing to raise Tin Can to its full potential. We argue that in essence, these learning process logs are provenance. Therefore, the W3C PROV model can provide the much-needed interoperability. In this paper, we introduce a method to expose PROV using Tin Can statements. To achieve this, we made the following contributions: (1) a formal ontology of the xAPI vocabulary, (2) a context document to interpret xAPI statements as JSON-LD, (3) a mapping to convert xAPI JSON-LD statements into PROV, and (4) a tool implementing this mapping.
   We preliminarily evaluate the approach by converting 20 xAPI statements taken from the public Tin Can Learning Record Store to valid PROV. Where the conversion succeeded, it did so without loss of valid information, therefore suggesting that the conversion process is reversible, as long as the original JSON is valid.
ECOLE: Student Knowledge Assessment in the Education Process BIBAFull-Text 695-700
  Dmitry Mouromtsev; Fedor Kozlov; Liubov Kovriguina; Olga Parkhimovich
The paper concerns estimation of students' knowledge based on their learning results in the ECOLE system. ECOLE is the online eLearning system which functionality is based on several ontologies. This system allows to interlink terms from different courses and domains and calculates several educational rates: term knowledge rate, total knowledge rate, domain knowledge rate and term significance rate. All of these rates are used to give the student recommendations about the activities he has to undertake to pass a course successfully.
Linking a Community Platform to the Linked Open Data Cloud BIBAFull-Text 701-703
  Enayat Rajabi; Ivana Marenzi
Linked Data promises access to a vast amount of resources for learners and teachers. Various research projects have focused on providing educational resources as Linked Data. In many of these projects the focus has been on interoperability of metadata and on linking them into the linked data cloud. In this paper we focus on the community aspect. We start from the observation that sharing data is most valuable within communities of practice with common interests and goals, and community members are interested in suitable resources to be used in specific learning scenarios. The community of practice we are focusing on is an English language teaching and learning community, which we have been supporting through the LearnWeb2.0 platform for the last two years. We analyse the requirements of this specific community as a basis to enrich the current collected materials with open educational resources taken from the Linked Data Cloud. To this aim, we performed an interlinking approach in order to enrich the learning resources exposed as RDF (Resource Description Framework) in the LearnWeb2.0 platform with additional information taken from the Web.
Towards Analysing the Scope and Coverage of Educational Linked Data on the Web BIBAFull-Text 705-710
  Davide Taibi; Giovanni Fulantelli; Stefan Dietze; Besnik Fetahu
The diversity of datasets published according to Linked Data (LD) principles has increased in the last few years and also led to the emergence of a wide range of data suitable in educational settings. However, sufficient insights into the state, coverage and scope of available educational Linked Data seem to be missing, for instance, about represented resource types or domains and topics. In this work, we analyse the scope and coverage of educational linked data on the Web, identifying the most popular resource types and topics, apparent gaps and underlining the strong correlation of resource types and topics. Our results indicate a prevalent bias to-wards data in areas such as the life sciences as well as computing-related topics.
Interconnecting and Enriching Higher Education Programs using Linked Data BIBAFull-Text 711-716
  Fouad Zablith
Online environments are increasingly used as platforms to support and enhance learning experiences. In higher education, students enroll in programs that are usually formed of a set of courses and modules. Such courses are designed to cover a set of concepts and achieve specific learning objectives that count towards the related degree. However we observe that connections among courses and the way they conceptually interlink are hard to exploit. This is normal as courses are traditionally described using text in the form of documents such as syllabi and course catalogs. We believe that linked data can be used to create a conceptual layer around higher education programs to interlink courses in a granular and reusable manner. We present in this paper our work on creating a semantic linked data layer to conceptually connect courses taught in a higher education program. We highlight the linked data model we created to be collaboratively extended by course instructors and students using a semantic Mediawiki platform. We also present two applications that we built on top of the data to (1) showcase how learning material can now float around courses through their interlinked concepts in eLearning environments (we use moodle as a proof of concept); and (2) to support the process of higher education program reviews.

LIME 2015

From Script Idea to TV Rerun: The Idea of Linked Production Data in the Media Value Chain BIBAFull-Text 719-720
  Harald Sack
Within the process of the production of a film or tv program a significant amount of metadata is created and -- most times -- lost again. As a consequence most of this valuable information has to be costly recreated in subsequent steps of media production, distribution, and archival. On the other hand, there is no commonly used metadata exchange format throughout all steps of the media value chain. Furthermore, technical systems and software applications used in the media production process often have proprietary interfaces for data exchange. In the course of the D-Werft project funded by the German government, metadata exchange through all steps of the media value chain is to be fostered by the application of Linked Data principles. Starting with the idea for a script, metadata from existing systems and applications will be mapped to ontologies to be reused in subsequent production steps. Also for distribution and archival, metadata collected during the production process is a valuable asset to be reused for semantic and exploratory search as well as for intelligent movie recommendation and customized advertising.
Enabling access to Linked Media with SPARQL-MM BIBAFull-Text 721-726
  Thomas Kurz; Kai Schlegel; Harald Kosch
The amount of audio, video and image data on the web is immensely growing, which leads to data management problems based on the hidden character of multimedia. Therefore the interlinking of semantic concepts and media data with the aim to bridge the gap between the document web and the Web of Data has become a common practice and is known as Linked Media. However, the value of connecting media to its semantic meta data is limited due to lacking access methods specialized for media assets and fragments as well as to the variety of used description models. With SPARQL-MM we extend SPARQL, the standard query language for the Semantic Web with media specific concepts and functions to unify the access to Linked Media. In this paper we describe the motivation for SPARQL-MM, present the State of the Art of Linked Media description formats and Multimedia query languages, and outline the specification and implementation of the SPARQL-MM function set.
Defining and Evaluating Video Hyperlinking for Navigating Multimedia Archives BIBAFull-Text 727-732
  Roeland J. F. Ordelman; Maria Eskevich; Robin Aly; Benoit Huet; Gareth Jones
Multimedia hyperlinking is an emerging research topic in the context of digital libraries and (cultural heritage) archives. We have been studying the concept of video-to-video hyperlinking from a video search perspective in the context of the MediaEval evaluation benchmark for several years. Our task considers a use case of exploring large quantities of video content via an automatically created hyperlink structure at the media fragment level. In this paper we report on our findings, examine the features of the definition of video hyperlinking based on results, and discuss lessons learned with respect to evaluation of hyperlinking in real-life use scenarios.
The TIB|AV Portal as a Future Linked Media Ecosystem BIBAFull-Text 733-734
  Paloma Marín Arraiza; Sven Strobel
Various techniques for video analysis, concept mapping, semantic search and metadata management are part of the current features of the TIB|AV Portal as described in this demo. The segment identification and ontology annotation make the portal a good platform to support the Linked Data and Media. Weaving into a machine-readable metadata format will complete this task.
MICO: Towards Contextual Media Analysis BIBAFull-Text 735-736
  Sergio Fernández; Sebastian Schaffert; Thomas Kurz
With the tremendous increase in multimedia content on the Web and in corporate intranets, discovering hidden meaning in raw multimedia is becoming one of the biggest challenges. Analysing multimedia content is still in its infancy, requires expert knowledge, and the few available products are associated with excessive price tags, while still not delivering sufficient quality for many tasks. This makes it hard, especially for small and medium-size enterprises, to make use of this technology. In addition analysis components typically operate in isolation and do not consider the context (e.g. embedding text) of a media resource. This paper presents how MICO tries to address these problems by providing an open source service platform, that allows to analyse media in context and includes various analysis engines for video, images, audio, text, link structure and metadata.
Automating Annotation of Media with Linked Data Workflows BIBAFull-Text 737-738
  Thomas Wilmering; Kevin Page; György Fazekas; Simon Dixon; Sean Bechhofer
Computational feature extraction provides one means of gathering structured analytic metadata for large media collections. We demonstrate a suite of tools we have developed that automate the process of feature extraction from audio in the Internet Archive. The system constructs an RDF description of the analysis workflow and results which is then reconciled and combined with Linked Data about the recorded performance. This Linked Data and provenance information provides the bridging information necessary to employ analytic output in the generation of structured metadata for the underlying media files, with all data published within the same description framework.

LocWeb 2015

Chatty, Happy, and Smelly Maps BIBAFull-Text 741
  Daniele Quercia
Mapping apps are the greatest game-changer for encouraging people to explore the city. You take your phone out and you know immediately where to go. However, the app also assumes there are only a handful of directions to the destination. It has the power to make those handful of directions the definitive direction to that destination. A few years ago, my research started to focus on understanding how people psychologically experience the city. I used computer science tools to replicate social science experiments at scale, at web scale [4,5]. I became captivated by the beauty and genius of traditional social science experiments done by Jane Jacobs, Stanley Milgram, Kevin Lynch[1,2,3]. The result of that research has been the creation of new maps, maps where one does not only find the shortest path but also the most enjoyable path [6,9].
   We did so by building a new city map weighted for human emotions. On this cartography, one is not only able to see and connect from point A to point B the shortest segments, but one is also able to see the happy path, the beautiful path, the quiet path. In tests, participants found the happy, the beautiful, the quiet paths far more enjoyable than the shortest one, and that just by adding a few minutes to travel time. Participants also recalled how some paths smelled and sounded. So what if we had a mapping tool that would return the most enjoyable routes based not only on aesthetics but also based on smell and sound? That is the research question this talk will start to address [7,8].
Verification of POI and Location Pairs via Weakly Labeled Web Data BIBAFull-Text 743-748
  Hsiu-Min Chuang; Chia-Hui Chang
With the increased popularity of mobile devices and smart phones, location-based services (LBS) have become a common need in our daily life. Therefore, maintaining the correctness of POI (Points of Interest) data has become an important issue for many location-based services such as Google Maps and Garmin navigation systems. The simplest form of POI contains a location (e.g., represented by an address) and an identifier (e.g., an organization name) that describes the location. As time goes by, the POI relationship of a location and organization pair may change due to the opening, moving, or closing of a business. Thus, effectively identifying outdated or emerging POI relations is an important issue for improving the quality of POI data. In this paper, we examine the possibility of using location-related pages on the Web to verify existing POI relations via weakly labeled data, e.g., the co-occurrence of an organization and an address in Web pages, the published date of such pages, and the pairing diversity of an address or an organization, etc. The preliminary result shows a promising direction for discovering emerging POI and mandates more research for outdated POI.
Reconnecting Digital Publications to the Web using their Spatial Information BIBAFull-Text 749-754
  Ben De Meester; Tom De Nies; Ruben Verborgh; Erik Mannens; Rik Van de Walle
Digital publications can be packaged and viewed via the Open Web Platform using the EPUB 3 format. Meanwhile, the increased amount of mobile clients and the advent of HTML5's Geolocation have opened a whole range of possibilities for digital publications to interact with their readers. However, EPUB 3 files often remain closed silos of information, no longer linked with the rest of the Web. In this paper, we propose a solution to reconnect digital publication with the (Semantic) Web. We will also show how we can use that connection to improve contextualization for a user, specifically via spatial information. We enrich digital publications by connecting the detected concepts to their URIs on, e.g., DBpedia, and by devising an algorithm to approximate the location of any detected concept, we can provide a user with the spatial center of gravity of his reading position. The evaluation of the location approximation algorithm showed a high recall, and the high correlation between estimation error and standard deviation can provide the user with a sense of correctness (or spread) of an approximation. This means relevant locations (and their possible radius) can be shown for a user, based on the content he or she is reading, and based on his or her location. This methodology can be used to reconnect digital publications with the online world, to entice readers, and ultimately, as a novel location-based recommendation technique.
The Role of Geographic Information in News Consumption BIBAFull-Text 755-760
  Gebrekirstos G. Gebremeskel; Arjen P. de Vries
We investigate the role of geographic proximity in news consumption. Using a month-long log of user interactions with news items of ten information portals, we study the relationship between users' geographic locations and the geographic foci of information portals and local news categories. We find that the location of news consumers correlates with the geographical information of the information portals at two levels: the portal and the local news category. At the portal level, traditional mainstream news portals have a more geographically focused readership than special interest portals, such as sports and technology. At a finer level, the mainstream news portals have local news sections that have even more geographically focused readerships.

MSM 2015

Are We Really Friends?: Link Assessment in Social Networks Using Multiple Associated Interaction Networks BIBAFull-Text 771-776
  Mohammed Abufouda; Katharina A. Zweig
Many complex network systems suffer from noise that disguises the structure of the network and hinders an accurate analysis of these systems. Link assessment is the process of identifying and eliminating the noise from network systems in order to better understand these systems. In this paper, we address the link assessment problem in social networks that may suffer from noisy relationships. We employed a machine learning classifier for assessing the links in the social network of interest using the data from the associated interaction networks around it. The method was tested with two different data sets: each contains the social network of interest, with ground truth, along with the associated interaction networks. The results showed that it is possible to effectively assess the links of a social network using only the structure of a single network of the associated interaction networks and also using the structure of the whole set of the associated interaction networks. The experiment also revealed that the assessment performance using only the structure of the social network of interest is relatively less accurate than using the associated interaction networks. This indicates that link formation in the social network of interest is not only driven by the internal structure of the social network, but also influenced by the external factors provided in the associated interaction networks.
This is your Twitter on drugs: Any questions? BIBAFull-Text 777-782
  Cody Buntain; Jennifer Golbeck
Twitter can be a rich source of information when one wants to monitor trends related to a given topic. In this paper, we look at how tweets can augment a public health program that studies emerging patterns of illicit drug use. We describe the architecture necessary to collect vast numbers of tweets over time based on a large number of search terms and the challenges that come with finding relevant information in the collected tweets. We then show several examples of early analysis we have done on this data, examining temporal and geospatial trends.
Using Context to Get Novel Recommendation in Internet Message Streams BIBAFull-Text 783-786
  Doina Alexandra Dumitrescu; Simone Santini
Novelty detection algorithms usually employ similarity measures with the previous seen and relevant documents to decide if a document is of user's interest. The problem that arises by using this approach is that the system might recommend redundant documents. Thus, it has become extremely important to be able to distinguish between "redundant" and "novel" information. To address this limitation, we apply a contextual and semantic approach by building the user profile using self-organizing maps that have the advantage to easily follow the changes in the users interests.
Determining Influential Users with Supervised Random Walks BIBAFull-Text 787-792
  Georgios Katsimpras; Dimitrios Vogiatzis; Georgios Paliouras
The emergence of social media and the enormous growth of social networks have initiated a great amount of research in social influence analysis. In this regard, many approaches take into account only structural information while a few have also incorporated content. In this study we propose a new method to rank users according to their topic-sensitive influence which utilizes a priori information by employing supervised random walks. We explore the use of supervision in a PageRank-like random walk while also exploiting textual information from the available content. We perform a set of experiments on Twitter datasets and evaluate our findings.
Community Change Detection in Dynamic Networks in Noisy Environment BIBAFull-Text 793-798
  Sadamori Koujaku; Mineichi Kudo; Ichigaku Takigawa; Hideyuki Imai
Detection of anomalous changes in social networks has been studied in various applications such as change detection of social interests and virus infections. Among several kinds of network changes, we concentrate on the structural changes of relatively small stationary communities. Such a change is important because it implies that some crucial changes have happened in a special group, such as dismiss of a board of directors. One difficulty is that we have to do this in a noisy environment. This paper, therefore, proposes an algorithm that finds stationary communities in a noisy environment. Experiments on two real networks showed the advantages of our proposed algorithm.
Locally Adaptive Density Ratio for Detecting Novelty in Twitter Streams BIBAFull-Text 799-804
  Yun-Qian Miao; Ahmed K. Farahat; Mohamed S. Kamel
With the massive growth of social data, a huge attention has been given to the task of detecting key topics in the Twitter stream. In this paper, we propose the use of novelty detection techniques for identifying both emerging and evolving topics in new tweets. In specific, we propose a locally adaptive approach for density-ratio estimation in which the density ratio between new and reference data is used to capture evolving novelties, and at the same time a locally adaptive kernel is employed into the density-ratio objective function to capture emerging novelties based on the local neighborhood structure. In order to address the challenges associated with short text, we adopt an efficient approach for calculating semantic kernels with the proposed density-ratio method. A comparison to different methods shows the superiority of the proposed algorithm.
Short-Text Clustering using Statistical Semantics BIBAFull-Text 805-810
  Sepideh Seifzadeh; Ahmed K. Farahat; Mohamed S. Kamel; Fakhri Karray
Short documents are typically represented by very sparse vectors, in the space of terms. In this case, traditional techniques for calculating text similarity results in measures which are very close to zero, since documents even the very similar ones have a very few or mostly no terms in common. In order to alleviate this limitation, the representation of short-text segments should be enriched by incorporating information about correlation between terms. In other words, if two short segments do not have any common words, but terms from the first segment appear frequently with terms from the second segment in other documents, this means that these segments are semantically related, and their similarity measure should be high. Towards achieving this goal, we employ a method for enhancing document clustering using statistical semantics. However, the problem of high computation time arises when calculating correlation between all terms. In this work, we propose the selection of a few terms, and using these terms with the Nyström method to approximate the term-term correlation matrix. The selection of the terms for the Nyström method is performed by randomly sampling terms with probabilities proportional to the lengths of their vectors in the document space. This allows more important terms to have more influence on the approximation of the term-term correlation matrix and accordingly achieves better accuracy.
A Novel Agent-Based Rumor Spreading Model in Twitter BIBAFull-Text 811-814
  Emilio Serrano; Carlos Ángel Iglesias; Mercedes Garijo
Viral marketing, marketing techniques that use pre-existing social networks, has experienced a significant encouragement in the last years. In this scope, Twitter is the most studied social network in viral marketing and the rumor spread is a widely researched problem. This paper contributes with a (1) novel agent-based social simulation model for rumors spread in Twitter. This model relies on the hypothesis that (2) when a user is recovered, this user will not influence his or her neighbors in the social network to recover. To support this hypothesis: (3) two Twitter rumor datasets are studied; (4) a baseline model which does not include the hypothesis is revised, reproduced, and implemented; (5) and a number of experiments are conducted comparing the real data with the two models results.
Popularity and Quality in Social News Aggregators: A Study of Reddit and Hacker News BIBFull-Text 815-818
  Greg Stoddard
Modeling Information Diffusion in Social Media as Provenance with W3C PROV BIBAFull-Text 819-824
  Io Taxidou; Tom De Nies; Ruben Verborgh; Peter M. Fischer; Erik Mannens; Rik Van de Walle
In recent years, research in information diffusion in social media has attracted a lot of attention, since the produced data is fast, massive and viral. Additionally, the provenance of such data is equally important because it helps to judge the relevance and trustworthiness of the information enclosed in the data. However, social media currently provide insufficient mechanisms for provenance, while models of information diffusion use their own concepts and notations, targeted to specific use cases. In this paper, we propose a model for information diffusion and provenance, based on the W3C PROV Data Model. The advantage is that PROV is a Web-native and interoperable format that allows easy publication of provenance data, and minimizes the integration effort among different systems making use of PROV.
Study on the Relationship between Profile Images and User Behaviors on Twitter BIBAFull-Text 825-828
  Tomu Tominaga; Yoshinori Hijikata
In recent years, many researchers have studied the characteristics of Twitter, which is a microblogging service used by a large number of people worldwide. However, to the best of our knowledge, no study has yet been conducted to study the relationship between profile images and user behaviors on Twitter. We assume that the profile images and behaviors of users are influenced by their internal properties, because users consider their profile images as symbolic representations of themselves on Twitter. We empirically categorized profile images into 13 types, and investigated the relationships between each category of profile images and users' behaviors on Twitter.
PTHMM: Beyond Single Specific Behavior Prediction BIBAFull-Text 829-832
  Suncong Zheng; Hongyun Bao; Guanhua Tian; Yufang Wu; Bo Xu; Hongwei Hao
Existing works on user behavior analysis mainly focus on modeling a single behavior and predicting whether a user will take an action or not. However, users' behaviors do not always happen in isolation, sometimes, different behaviors may happen simultaneously. Therefore, in this paper, we try to analyze the combination of basic behaviors, called behavioral state here, which can describes users' complex behaviors comprehensively. We propose a model, called Personal Timed Hidden Markov Model (PTHMM), to settle the problem by considering time-interval information of users' behaviors and users' personalization. The experimental result on sina-weibo demonstrates the effectiveness of the model. It also shows that users' behavioral state is affected by their historical behaviors, and the influence of historical behaviors declines with the increasing of historical time.

MWA 2015

Multilingual Word Sense Induction to Improve Web Search Result Clustering BIBAFull-Text 835-839
  Lorenzo Albano; Domenico Beneventano; Sonia Bergamaschi
In [Marco2013] a novel approach to Web search result clustering based on Word Sense Induction, i.e. the automatic discovery of word senses from raw text was presented; key to the proposed approach is the idea of, first, automatically inducing senses for the target query and, second, clustering the search results based on their semantic similarity to the word senses induced. In [1] we proposed an innovative Word Sense Induction method based on multilingual data; key to our approach was the idea that a multilingual context representation, where the context of the words is expanded by considering its translations in different languages, may improve the WSI results; the experiments showed a clear performance gain. In this paper we give some preliminary ideas to exploit our multilingual Word Sense Induction method to Web search result clustering.
Document Categorization using Multilingual Associative Networks based on Wikipedia BIBAFull-Text 841-846
  Niels Bloom; Mariët Theune; Franciska De Jong
Associative networks are a connectionist language model with the ability to categorize large sets of documents. In this research we combine monolingual associative networks based on Wikipedia to create a larger, multilingual associative network, using the cross-lingual connections between Wikipedia articles. We prove that such multilingual associative networks perform better than monolingual associative networks in tasks related to document categorization by comparing the results of both types of associative network on a multilingual dataset.
Exceptional Texts On The Multilingual Web BIBAFull-Text 847-851
  Gavin Brelstaff; Francesca Chessa
Great writers help keep a language efficient for discourse of all kinds. In doing so they produce exceptional texts which may defy Statistical Machine Translation by employing uncommon idiom. Such "turns of phrase" can enter into a Nation's collective memory and form the basis from which compassion and conviction are conveyed during important national discourse. Communities that unite across language barriers have no such robust basis for discourse. Here we describe a Multilingual Web prototype application that promotes appreciation of exceptional texts by non-native readers. The application allows dual column original/translation texts (in Open Office format) to be imported into the translator's browser, to be manually aligned for semantic correspondence, to be aligned with an audio reading, and then saved as HTML5 for subsequent presentation to non-native readers. We hope to provide a new way of experiencing exceptional texts (poetry, here) that transmits their significance without incurring extraneous distraction. We motivate, outline and illustrate our application in action.
"Answer ka type kya he?": Learning to Classify Questions in Code-Mixed Language BIBAFull-Text 853-858
  Khyathi Chandu Raghavi; Manoj Kumar Chinnakotla; Manish Shrivastava
Code-Mixing (CM) is defined as the embedding of linguistic units such as phrases, words, and morphemes of one language into an utterance of another language. CM is a natural phenomenon observed in many multilingual societies. It helps in speeding-up communication and allows wider variety of expression due to which it has become a popular mode of communication in social media forums like Facebook and Twitter. However, current Question Answering (QA) research and systems only support expressing a question in a single language which is an unrealistic and hard proposition especially for certain domains like health and technology. In this paper, we take the first step towards the development of a full-fledged QA system in CM language which is building a Question Classification (QC) system. The QC system analyzes the user question and infers the expected Answer Type (AType). The AType helps in locating and verifying the answer as it imposes certain type-specific constraints. In this paper, we present our initial efforts towards building a full-fledged QA system for CM language. We learn a basic Support Vector Machine (SVM) based QC system for English-Hindi CM questions. Due to the inherent complexities involved in processing CM language and also the unavailability of language processing resources such POS taggers, Chunkers, Parsers, we design our current system using only word-level resources such as language identification, transliteration and lexical translation. To reduce data sparsity and leverage resources available in a resource-rich language, in stead of extracting features directly from the original CM words, we translate them commonly into English and then perform featurization. We created an evaluation dataset for this task and our system achieves an accuracy of 63% and 45% in coarse-grained and fine-grained categories of the question taxonomy. The idea of translating features into English indeed helps in improving accuracy over the unigram baseline.
A Comparative Study of Online Translation Services for Cross Language Information Retrieval BIBAFull-Text 859-864
  Ali Hosseinzadeh Vahid; Piyush Arora; Qun Liu; Gareth J. F. Jones
Technical advances and its increasing availability, mean that Machine Translation (MT) is now widely used for the translation of search queries in multilingual search tasks. A number of free-to-use high-quality online MT systems are now available and, although imperfect in their translation behavior, are found to produce good performance in Cross-Language Information Retrieval (CLIR) applications. Users of these MT systems in CLIR tasks generally assume that they all behave similarly in CLIR applications, and the choice of MT system is often made on the basis of convenience. We present a set of experiments which compare the impact of applying two of the best known online systems, Google and Bing translation, for query translation across multiple language pairs and for two very different CLIR tasks. Our experiments show that the MT systems perform differently on average for different tasks and language pairs, but more significantly for different individual queries. We examine the differing translation behavior of these tools and seek to draw conclusions in terms of their suitability for use in different settings.
Understanding Multilingual Social Networks in Online Immigrant Communities BIBAFull-Text 865-870
  Evangelos Papalexakis; A. Seza Dogruöz
There are more multilingual speakers in the world than monolingual ones. Immigration is one of the key factors to bring speakers of different languages in contact with each other. In order to develop relevant policies and recommendations tailored according to the needs of immigrant communities, it is essential to understand the interactions between the users within and across sub-communities. Using a novel method (tensor analysis), we reveal the social network structure of an online multilingual discussion forum which hosts an immigrant community in the Netherlands. In addition to the network structure, we automatically discover and categorize monolingual and bilingual sub-communities and track their formation, evolution and dissolution over a long period of time.
Exploring Current Accessibility Challenges in the Multilingual Web for Visually-Impaired Users BIBAFull-Text 871-873
  Silvia Rodríguez Vázquez
The Web is an open network accessed by people across countries, languages and cultures, irrespective of their functional diversity. Over the last two decades, interest about web accessibility issues has significantly increased among web professionals, but people with disabilities still encounter significant difficulties when browsing the Internet. In the particular case of blind users, the use of assistive technologies such as screen readers is key to navigate and interact with web content. Although research efforts made until now have led to a better understanding of visually-impaired users' browsing behavior and, hence, the definition of web design best practices for an improved user experience by this population group, the particularities of websites with multiple language versions have been mostly overlooked. This communication paper seeks to shed light on the major challenges faced by visually impaired users when accessing the multilingual web, as well as on why and how the web localization community should contribute to a more accessible web for all.
Online Searching in English as a Foreign Language BIBAFull-Text 875-880
  Gyöngyi Rózsa; Anita Komlodi; Peng Chu
Online searching is a central element of internet users' information behaviors. Searching is usually executed in a user's native language, but searching in English as a foreign language is often necessitated by the lack of content in languages that are underrepresented in Web content. This paper reports results from a study of searching in English as a foreign language and aims at understanding this particular group of users' behaviors. Searchers whose native language is not English may have to resort to queries in English in support of their information needs due to the lack or low quality of the web content in their own language. However, when searching for information in a foreign language, users face a unique set of challenges that are not present for native language searching. We studied this problem through qualitative research methods and report results from focus groups in this paper. The results reported in this paper describe typical problems foreign language searchers face, the differences in information-seeking behavior in English and in the participants' native language, and advice and ideas shared by the focus group participants about how to search effectively and efficiently in English.

NewsWWW 2015

Supply and Demand: Propagation and Absorption of News BIBAFull-Text 883
  Anastassia Fedyk
The importance of the media for individual and market behavior cannot be overstated. For example, a front-page article in the New York Times that mostly reprints information from six months prior can cause a company's stock price to jump by over 300%. To better understand the channels through which the media affects markets and the resulting implications for news production, we study how individuals process information in news. Do readers display a preference for news with a positive slant? Are consumers of news segregated based on the media outlets they favor? Do individuals recognize which news is novel, and which simply reprints old information? While these questions are grounded in fundamental human psychology, they are also inextricably linked to the rapidly changing technology of news production. With over a million stories a day passing through the Bloomberg terminal alone, the volume of data -- both on the content of news and the behavior of readers -- has skyrocketed. As a result, analysis of media production and consumption requires ever more sophisticated techniques for identifying the informational value of news and the behavioral patterns of its modern readers.
Scalable Preference Learning from Data Streams BIBAFull-Text 885-890
  Fabon Dzogang; Thomas Lansdall-Welfare; Saatviga Sudhahar; Nello Cristianini
We study the task of learning the preferences of online readers of news, based on their past choices. Previous work has shown that it is possible to model this situation as a competition between articles, where the most appealing articles of the day are those selected by the most users. The appeal of an article can be computed from its textual content, and the evaluation function can be learned from training data. In this paper, we show how this task can benefit from an efficient algorithm, based on hashing representations, which enables it to be deployed on high intensity data streams. We demonstrate the effectiveness of this approach on four real world news streams, compare it with standard approaches, and describe a new online demonstration based on this technology.
Interpreting News Recommendation Models BIBAFull-Text 891-892
  Blaz Fortuna; Pat Moore; Marko Grobelnik
This paper presents an approach for recommending news articles on a large news portal. Focus is given to interpretability of the developed models, analysis of their performance, and deriving understanding of short and long-term user behavior on a news portal.
Measuring Gender Bias in News Images BIBAFull-Text 893-898
  Sen Jia; Thomas Lansdall-Welfare; Nello Cristianini
Analysing the representation of gender in news media has a long history within the fields of journalism, media and communication. Typically this can be performed by measuring how often people of each gender are mentioned within the textual content of news articles. In this paper, we adopt a different approach, classifying the faces in images of news articles into their respective gender. We present a study on 885,573 news articles gathered from the web, covering a period of four months between 19th October 2014 and 19th January 2015 from 882 news outlets. Findings show that gender bias differs by topic, with Fashion and the Arts showing the least bias. Comparisons of gender bias by outlet suggest that tabloid-style news outlets may be less gender-biased than broadsheet-style ones, supporting previous results from textual content analysis of news articles.
Towards a Complete Event Type Taxonomy BIBAFull-Text 899-902
  Aljaz Košmerlj; Evgenia Belyaeva; Gregor Leban; Marko Grobelnik; Blaz Fortuna
We present initial results of our effort to build an extensive and complete taxonomy of events described in news articles. By crawling Wikipedia's current events portal we identified nine top-level event types. Using articles referenced by the portal we built a event type classification model for news articles using lexical and semantic features and present a small-scale manual evaluation of its results. Results show that our model can accurately distinguish between event types but its coverage could still be significantly improved.
The Computable News project: Research in the Newsroom BIBAFull-Text 903-908
  Will Radford; Daniel Tse; Joel Nothman; Ben Hachey; George Wright; James R. Curran; Will Cannings; Tim O'Keefe; Matthew Honnibal; David Vadas; Candice Loxley
We report on a four year academic research project to build a natural language processing platform in support of a large media company. The Computable News platform processes news stories, producing a layer of structured data that can be used to build rich applications. We describe the underlying platform and the research tasks that we explored building it. The platform supports a wide range of prototype applications designed to support different newsroom functions. We hope that this qualitative review provides some insight into the challenges involved in this type of project.

OOEW 2015

Adaptive Sequential Experimentation Techniques for A/B Testing and Model Tuning BIBAFull-Text 911
  Scott Clark
We introduce Bayesian Global Optimization as an efficient way to optimize a system's parameters, when evaluating parameters is time-consuming or expensive. The adaptive sequential experimentation techniques described can be used to help tackle a myriad of problems including optimizing a system's click-through or conversion rate via online A/B testing, tuning parameters of a machine learning prediction method or expensive batch job, designing an engineering system or finding the optimal parameters of a real-world physical experiment. We explore different tools available for performing these tasks, including Yelp's MOE and SigOpt. We will present the motivation, implementation, and background of these tools. Applications and examples from industry and best practices for using the techniques will be provided.
Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments BIBAFull-Text 913
  Alex Deng
As A/B testing gains wider adoption in the industry, more people begin to realize the limitations of the traditional frequentist null hypothesis statistical testing (NHST). The large number of search results for the query "Bayesian A/B testing" shows just how much the interest in the Bayesian perspective is growing. In recent years there are also voices arguing that Bayesian A/B testing should replace frequentist NHST and is strictly superior in all aspects. Our goal here is to clarify the myth by looking at both advantages and issues of Bayesian methods. In particular, we propose an objective Bayesian A/B testing framework for which we hope to bring the best from Bayesian and frequentist methods together. Unlike traditional methods, this method requires the existence of historical A/B test data to objectively learn a prior. We have successfully applied this method to Bing, using thousands of experiments to establish the priors.
Can I Take a Peek?: Continuous Monitoring of Online A/B Tests BIBAFull-Text 915
  Ramesh Johari
A/B testing is a hallmark of Internet services: from e-commerce sites to social networks to marketplaces, nearly all online services use randomized experiments as a mechanism to make better business decisions. Such tests are generally analyzed using classical frequentist statistical measures: p-values and confidence intervals. Despite their ubiquity, these reported values are computed under the assumption that the experimenter will not continuously monitor their test -- in other words, there should be no repeated "peeking" at the results that affects the decision of whether to continue the test. On the other hand, one of the greatest benefits of advances in information technology, computational power, and visualization is precisely the fact that experimenters can watch experiments in progress, with greater granularity and insight over time than ever before.
   We ask the question: if users will continuously monitor experiments, then what statistical methodology is appropriate for hypothesis testing, significance, and confidence intervals? We present recent work addressing this question. In particular, building from results in sequential hypothesis testing, we present analogues of classical frequentist statistical measures that are valid even though users are continuously monitoring the results.
Online Search Evaluation with Interleaving BIBAFull-Text 917
  Filip Radlinski
Online evaluation allows information retrieval systems to be assessed based on how real users respond to search results presented. Compared with traditional offline evaluation based on manual relevance assessments, online evaluation is particularly attractive in settings where reliable assessments are difficult or too expensive to obtain. However, the successful use of online evaluation requires the right metrics to be used, as real user behaviour is often difficult to interpret. I will present interleaving, a sensitive online evaluation approach that creates paired comparisons for every user query, and compare it with alternative A/B online evaluation approaches. I will also show how interleaving can be parameterized to create a family of evaluation metrics that can be chosen to best match the goals of an evaluation.
Offline Evaluation of Response Prediction in Online Advertising Auctions BIBAFull-Text 919-922
  Olivier Chapelle
Click-through rates and conversion rates are two core machine learning problems in online advertising. The evaluation of such systems is often based on traditional supervised learning metrics that ignore how the predictions are used. These predictions are in fact part of bidding systems in online advertising auctions. We present here an empirical evaluation of a metric that is specifically tailored for auctions in online advertising and show that it correlates better than standard metrics with A/B test results.
Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments BIBAFull-Text 923-928
  Alex Deng
As A/B testing gains wider adoption in the industry, more people begin to realize the limitations of the traditional frequentist null hypothesis statistical testing (NHST). The large number of search results for the query "Bayesian A/B testing" shows just how much the interest in the Bayesian perspective is growing. In recent years there are also voices arguing that Bayesian A/B testing should replace frequentist NHST and is strictly superior in all aspects. Our goal here is to clarify the myth by looking at both advantages and issues of Bayesian methods. In particular, we propose an objective Bayesian A/B testing framework for which we hope to bring the best from Bayesian and frequentist methods together. Unlike traditional methods, this method requires the existence of historical A/B test data to objectively learn a prior. We have successfully applied this method to Bing, using thousands of experiments to establish the priors.
Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study BIBAFull-Text 929-934
  Lihong Li; Shunbao Chen; Jim Kleban; Ankur Gupta
Optimizing an interactive system against a predefined online metric is particularly challenging, especially when the metric is computed from user feedback such as clicks and payments. The key challenge is the counterfactual nature: in the case of Web search, any change to a component of the search engine may result in a different search result page for the same query, but we normally cannot infer reliably from search log how users would react to the new result page. Consequently, it appears impossible to accurately estimate online metrics that depend on user feedback, unless the new engine is actually run to serve live users and compared with a baseline in a controlled experiment. This approach, while valid and successful, is unfortunately expensive and time-consuming. In this paper, we propose to address this problem using causal inference techniques, under the contextual-bandit framework. This approach effectively allows one to run potentially many online experiments offline from search log, making it possible to estimate and optimize online metrics quickly and inexpensively. Focusing on an important component in a commercial search engine, we show how these ideas can be instantiated and applied, and obtain very promising results that suggest the wide applicability of these techniques.
Unbiased Ranking Evaluation on a Budget BIBAFull-Text 935-937
  Tobias Schnabel; Adith Swaminathan; Thorsten Joachims
We address the problem of assessing the quality of a ranking system (e.g., search engine, recommender system, review ranker) given a fixed budget for collecting expert judgments. In particular, we propose a method that selects which items to judge in order to optimize the accuracy of the quality estimate. Our method is not only efficient, but also provides estimates that are unbiased -- unlike common approaches that tend to underestimate performance or that have a bias against new systems that are evaluated re-using previous relevance scores.
Counterfactual Risk Minimization BIBAFull-Text 939-941
  Adith Swaminathan; Thorsten Joachims
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we derive generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. Using the CRM principle, we derive a new learning algorithm -- Policy Optimizer for Exponential Models (POEM) -- for structured output prediction. We evaluate POEM on several multi-label classification problems and verify that its empirical performance supports the theory.
What Size Should A Mobile Ad Be? BIBAFull-Text 943-944
  Pengyuan Wang; Wei Sun; Dawei Yin
We present a causal inference framework for evaluating the impact of advertising treatments. Our framework is computationally efficient by employing a tree structure that specifies the relationship between user characteristics and the corresponding ad treatment. We illustrate the applicability of our proposal on a novel advertising effectiveness study: finding the best ad size on different mobile devices in order to maximize the success rates. The study shows a surprising phenomenon that a larger mobile device does not need a larger ad. In particular, the 300*250 ad size is universally good for all the mobile devices, regardless of the mobile device size.

RDSM 2015

Discriminative Models for Predicting Deception Strategies BIBAFull-Text 947-952
  Darren Scott Appling; Erica J. Briscoe; Clayton J. Hutto
Although a large body of work has previously investigated various cues predicting deceptive communications, especially as demonstrated through written and spoken language (e.g., [30]), little has been done to explore predicting kinds of deception. We present novel work to evaluate the use of textual cues to discriminate between deception strategies (such as exaggeration or falsification), concentrating on intentionally untruthful statements meant to persuade in a social media context. We conduct human subjects experimentation wherein subjects were engaged in a conversational task and then asked to label the kind(s) of deception they employed for each deceptive statement made. We then develop discriminative models to understand the difficulty between choosing between one and several strategies. We evaluate the models using precision and recall for strategy prediction among 4 deception strategies based on the most relevant psycholinguistic, structural, and data-driven cues. Our single strategy model results demonstrate as much as a 58% increase over baseline (random chance) accuracy and we also find that it is more difficult to predict certain kinds of deception than others.
Assessment of Tweet Credibility with LDA Features BIBAFull-Text 953-958
  Jun Ito; Jing Song; Hiroyuki Toda; Yoshimasa Koike; Satoshi Oyama
With the fast development of Social Networking Services (SNS) such as Twitter, which enable users to exchange short messages online, people can get information not only from the traditional news media but also from the masses of SNS users. However, SNS users sometimes propagate spurious or misleading information, so an effective way to automatically assess the credibility of information is required. In this paper, we propose methods to assess information credibility on Twitter, methods that utilize the "tweet topic" and "user topic" features derived from the Latent Dirichlet Allocation (LDA) model. We collected two thousand tweets labeled by seven annotators each, and designed effective features for our classifier on the basis of data analysis results. An experiment we conducted showed a 3% improvement in Area Under Curve (AUC) scores compared with existing methods, leading us to conclude that using topical features is an effective way to assess tweet credibility.
Visualization of Trustworthiness Graphs BIBAFull-Text 959-964
  Stephen Mayhew; Dan Roth
Trustworthiness is a field of research that seeks to estimate the credibility of information by using knowledge of the source of the information. The most interesting form of this problem is when different pieces of information share sources, and when there is conflicting information from different sources. This model can be naturally represented as a bipartite graph. In order to understand this data well, it is important to have several methods of exploring it. A good visualization can help to understand the problem in a way that no simple statistics can. This paper defines several desiderata for a "good" visualization and presents three different visualization methods for trustworthiness graphs.
   The first visualization method is simply a naive bipartite layout, which is infeasible in nearly all cases. The second method is a physics-based graph layout that reveals some interesting and important structure of the graph. The third method is an orthogonal approach based on the adjacency matrix representation of a graph, but with many improvements that give valuable insights into the structure of the trustworthiness graph.
   We present interactive web-based software for the third form of visualization.
Crowdsourced Rumour Identification During Emergencies BIBAFull-Text 965-970
  Richard McCreadie; Craig Macdonald; Iadh Ounis
When a significant event occurs, many social media users leverage platforms such as Twitter to track that event. Moreover, emergency response agencies are increasingly looking to social media as a source of real-time information about such events. However, false information and rumours are often spread during such events, which can influence public opinion and limit the usefulness of social media for emergency management. In this paper, we present an initial study into rumour identification during emergencies using crowdsourcing. In particular, through an analysis of three tweet datasets relating to emergency events from 2014, we propose a taxonomy of tweets relating to rumours. We then perform a crowdsourced labeling experiment to determine whether crowd assessors can identify rumour-related tweets and where such labeling can fail. Our results show that overall, agreement over the tweet labels produced were high (0.7634 Fleiss Kappa), indicating that crowd-based rumour labeling is possible. However, not all tweets are of equal difficulty to assess. Indeed, we show that tweets containing disputed/controversial information tend to be some of the most difficult to identify.
Detecting Singleton Review Spammers Using Semantic Similarity BIBAFull-Text 971-976
  Vlad Sandulescu; Martin Ester
Online reviews have increasingly become a very important resource for consumers when making purchases. Though it is becoming more and more difficult for people to make well-informed buying decisions without being deceived by fake reviews. Prior works on the opinion spam problem mostly considered classifying fake reviews using behavioral user patterns. They focused on prolific users who write more than a couple of reviews, discarding one-time reviewers. The number of singleton reviewers however is expected to be high for many review websites. While behavioral patterns are effective when dealing with elite users, for one-time reviewers, the review text needs to be exploited. In this paper we tackle the problem of detecting fake reviews written by the same person using multiple names, posting each review under a different name. We propose two methods to detect similar reviews and show the results generally outperform the vectorial similarity measures used in prior works. The first method extends the semantic similarity between words to the reviews level. The second method is based on topic modeling and exploits the similarity of the reviews topic distributions using two models: bag-of-words and bag-of-opinion-phrases. The experiments were conducted on reviews from three different datasets: Yelp (57K reviews), Trustpilot (9K reviews) and Ott dataset (800 reviews).
Fact-checking Effect on Viral Hoaxes: A Model of Misinformation Spread in Social Networks BIBAFull-Text 977-982
  Marcella Tambuscio; Giancarlo Ruffo; Alessandro Flammini; Filippo Menczer
spread of misinformation, rumors and hoaxes. The goal of this work is to introduce a simple modeling framework to study the diffusion of hoaxes and in particular how the availability of debunking information may contain their diffusion. As traditionally done in the mathematical modeling of information diffusion processes, we regard hoaxes as viruses: users can become infected if they are exposed to them, and turn into spreaders as a consequence. Upon verification, users can also turn into non-believers and spread the same attitude with a mechanism analogous to that of the hoax-spreaders. Both believers and non-believers, as time passes, can return to a susceptible state. Our model is characterized by four parameters: spreading rate, gullibility, probability to verify a hoax, and that to forget one's current belief. Simulations on homogeneous, heterogeneous, and real networks for a wide range of parameters values reveal a threshold for the fact-checking probability that guarantees the complete removal of the hoax from the network. Via a mean field approximation, we establish that the threshold value does not depend on the spreading rate but only on the gullibility and forgetting probability. Our approach allows to quantitatively gauge the minimal reaction necessary to eradicate a hoax.
Real-Time News Certification System on Sina Weibo BIBAFull-Text 983-988
  Xing Zhou; Juan Cao; Zhiwei Jin; Fei Xie; Yu Su; Dafeng Chu; Xuehui Cao; Junqiang Zhang
In this paper, we propose a novel framework for real-time news certification. Traditional methods detect rumors on message-level and analyze the credibility of one tweet. However, in most occasions, we only remember the keywords of an event and it's hard for us to completely describe an event in a tweet. Based on the keywords of an event, we gather related microblogs through a distributed data acquisition system which solves the real-time processing needs. Then, we build an ensemble model that combine user-based, propagation-based and content-based model. The experiments show that our system can give a response at 35 seconds on average per query which is critical for real-time system. Most importantly, our ensemble model boost the performance. We also offer some important information such as key users, key microblogs and timeline of events for further investigation of an event. Our system is already deployed in the Xihua News Agency for half a year. To the best of our knowledge, this is the first real-time news certification system for verifying social media contents.

SAVE-SD 2015

Increasing the Productivity of Scholarship: The Case for Knowledge Graphs BIBAFull-Text 993
  Paul Groth
Over the past several years, we have seen an explosion in the number of tools and services that enable scholars to improve their personal productivity whether it is socially enabled reference managers or cloud-hosted experimental environments. However, we have yet to see a step-change in the productivity of the system of scholarship as a whole. While there are certainly broader social reasons for this, in this talk I argue that we are just now at a technical position to create radical change in how scholarship is performed. Specifically, I will discuss how recent advances in machine reading, developments in open data and explicit social networks, can be used to create scholarly knowledge graphs. These graphs can connect the underlying intellectual corpus with ongoing discourse allowing the development of algorithms that hypothesize, filter and reflect alongside humans.
It is not What but Who you Know: A Time-Sensitive Collaboration Impact Measure of Researchers in Surrounding Communities BIBAFull-Text 995-1000
  Luigi Di Caro; Mario Cataldi; Myriam Lamolle; Claudio Schifanella
In the last decades, many measures and metrics have been proposed with the goal of automatically providing quantitative rather than qualitative indications over researchers' academic productions. However, when evaluating a researcher, most of the commonly-applied measures do not consider one of the key aspect of every research work: the collaborations among researchers and, more specifically, the impact that each co-author has on the scientific production of another. In fact, in an evaluation process, some co-authored works can unconditionally favor researchers working in competitive research environments surrounded by experts able to lead high-quality research projects, where state-of-the-art measures usually fail in trying to distinguish co-authors from their pure publication history. In the light of this, instead of focusing on a pure quantitative/qualitative evaluation of curricula, we propose a novel temporal model for formalizing and estimating the dependence of a researcher on individual collaborations, over time, in surrounding communities. We then implemented and evaluated our model with a set of experiments on real case scenarios and through an extensive user study.
Exploring Bibliographies for Research-related Tasks BIBAFull-Text 1001-1006
  Angelo Di Iorio; Raffaele Giannella; Francesco Poggi; Fabio Vitali
Bibliographies are fundamental tools for research communities. Besides the obvious uses as connection to previous research, citations are also widely used for evaluation purposes: the productivity of researchers, departments and universities is increasingly measured by counting their citations. Unfortunately, citations counters are just rough indicators: a deeper knowledge of individual citations -- where, when, by whom and why -- improves research evaluation tasks and supports researchers in their daily activity. Yet, such information is mostly hidden within repositories of scholarly papers and is still difficult to find, navigate and make use of.
   In this paper, we present a novel tool for exploring scientific articles through their citations. The environment is built on top of a rich citation network, encoded as a LOD, and includes a user-friendly interface to access, filter and highlight information about bibliographic data.
Conference Live: Accessible and Sociable Conference Semantic Data BIBAFull-Text 1007-1012
  Anna Lisa Gentile; Maribel Acosta; Luca Costabello; Andrea Giovanni Nuzzolese; Valentina Presutti; Diego Reforgiato Recupero
In this paper we describe Conference Live, a semantic Web application to browse conference data. Conference Live is a Web and mobile application based on conference data from the Semantic Web Dog Food server, which provides facilities to browse papers and authors at a specific conference. Available data for the specific conference is enriched with social features (e.g. integrated Twitter accounts of paper authors), scheduling features (calendar information are attached for paper presentations and social events), the possibility to check and add feedback to each paper and to vote for papers, if the conference includes sessions where participants can vote, as it is popular e.g. for poster sessions. As use case we report on the usage of the application at the Extended Semantic Web Conference (ESWC) in May 2014.
Period Assertion as Nanopublication: The PeriodO Period Gazetteer BIBAFull-Text 1013-1018
  Patrick Golden; Ryan Shaw
The PeriodO period gazetteer collects definitions of time periods made by archaeologists and other historical scholars. In constructing the gazetteer, we sought to make period definitions parsable and comparable by computers while also retaining the broader scholarly context in which they were conceived. Our approach resulted in a dataset of period definitions and their provenances that resemble what data scientists working in the e-science domain have dubbed "nanopublications." In this paper we describe the origin and goals of nanopublications, provide an overview of the design and implementation of a database of period definitions, and highlight the similarities and differences between the two.
The Paper or the Video: Why Choose? BIBAFull-Text 1019-1022
  Hugo Mougard; Matthieu Riou; Colin de la Higuera; Solen Quiniou; Olivier Aubert
This paper investigates the possibilities offered by the more and more common availability of scientific video material. In particular it investigates how to best study research results by combining recorded talks and their corresponding scientific articles. To do so, it outlines desired properties of an interesting e-research system based on cognitive considerations and considers related issues. This design work is completed by the introduction of two prototypes.
What's in this paper?: Combining Rhetorical Entities with Linked Open Data for Semantic Literature Querying BIBAFull-Text 1023-1028
  Bahar Sateli; René Witte
Finding research literature pertaining to a task at hand is one of the essential tasks that scientists face on daily basis. Standard information retrieval techniques allow to quickly obtain a vast number of potentially relevant documents. Unfortunately, the search results then require significant effort for manual inspection, where we would rather select relevant publications based on more fine-grained, semantically rich queries involving a publication's contributions, methods, or application domains. We argue that a novel combination of three distinct methods can significantly advance this vision: (i) Natural Language Processing (NLP) for Rhetorical Entity (RE) detection; (ii) Named Entity (NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic generation of RDF triples for both NEs and REs using semantic web ontologies to interconnect them. Combined in a single workflow, these techniques allow us to automatically construct a knowledge base that facilitates numerous advanced use cases for managing scientific documents.
Using Linked Data Traversal to Label Academic Communities BIBAFull-Text 1029-1034
  Ilaria Tiddi; Mathieu d'Aquin; Enrico Motta
In this paper we exploit knowledge from Linked Data to ease the process of analysing scholarly data. In the last years, many techniques have been presented with the aim of analysing such data and revealing new, unrevealed knowledge, generally presented in the form of "patterns". However, the discovered patterns often still require human interpretation to be further exploited, which might be a time and energy consuming process. Our idea is that the knowledge shared within Linked Data can actuality help and ease the process of interpreting these patterns. In practice, we show how research communities obtained through standard network analytics techniques can be made more understandable through exploiting the knowledge contained in Linked Data. To this end, we apply our system Dedalo that, by performing a simple Linked Data traversal, is able to automatically label clusters of words, corresponding to topics of the different communities.
A Model for Copyright and Licensing: Elsevier's Copyright Model BIBAFull-Text 1035-1038
  Anna Tordai
With the rise of digital publishing and open access it has become increasingly important for publishers to store information about ownership and licensing of published works in a robust way. In a fast moving environment Elsevier, a leading science publisher, recognizes the importance of sound models underlying its data. In this paper, we describe a data model for copyright and licensing used by Elsevier for capturing elements of copyright. We explain some of the rationale behind the model and provide examples of frequently occurring cases in terms of the model.
Mapping The Evolution of Scientific Community Structures in Time BIBAFull-Text 1039-1044
  Theresa Velden; Shiyan Yan; Kan Yu; Carl Lagoze
The increasing online availability of scholarly corpora promises unprecedented opportunities for visualizing and studying scholarly communities. We seek to leverage this with a mixed-method approach that integrates network analysis of features of the online corpora with ethnographic studies of the communities that produce them. In our development of tools and visualizations we seek to support the going back and forth between views of community structures and the perceptions and research trajectories of individual researchers and research groups. We here present results from tracking the temporal evolution of community structures within a research specialty. We explore how the temporal evolution of these maps can be used to provide insights into the historical evolution of a field as well as extract more accurate snapshots of the community structures at a given point in time. We are currently conducting qualitative interviews with experts in this research specialty to assess the validity of the maps.
Research Collaboration and Topic Trends in Computer Science: An Analysis Based on UCP Authors BIBAFull-Text 1045-1050
  Yan Wu; Srinivasan Venkatramanan; Dah Ming Chiu
Academic publication metadata can be used to analyze the collaboration, productivity and hot topic trends of a research community. Recently, it is shown that authors with uninterrupted and continuous presence (UCP) over a time window, though small in number (about 1%), amass the majority of significant and high-influence academic output. We adopt the UCP metric to retrieve the most active authors in the Computer Science (CS) community over different time windows in the past 50 years, and use them to analyze collaboration, productivity and topic trends. We show that the UCP authors are representative of the overall population; the community is increasingly moving in the direction of Team Research (as opposed to Soloist or Mentor-mentee research), with increased level and degree of collaboration; and the research topics become increasingly inter-related. By focusing on the UCP authors, we can more easily visualize these trends.
Enhanced Publication Management Systems: A Systemic Approach Towards Modern Scientific Communication BIBAFull-Text 1051-1052
  Alessia Bardi; Paolo Manghi
Enhanced Publication Information Systems (EPISs) are information systems devised for the management of enhanced publications (EP), i.e. digital publications enriched with (links to) other research outcomes such as data, processing workflows, software. Today, EPISs are typically realised with a "from scratch" approach that entails non-negligible implementation and maintenance costs.
   This work argues for a more systemic approach to narrow those costs and presents the notion of Enhanced Publication Management Systems, software frameworks that support the realisation of EPISs by providing developers with EP-oriented tools and functionalities.
Visualizing Collaborations and Online Social Interactions at Scientific Conferences for Scholarly Networking BIBAFull-Text 1053-1054
  Laurens De Vocht; Selver Softic; Anastasia Dimou; Ruben Verborgh; Erik Mannens; Martin Ebner; Rik Van de Walle
The various ways of interacting with social media, web collaboration tools, co-authorship and citation networks for scientific and research purposes remain distinct. In this paper, we propose a solution to align such information. We particularly developed an exploratory visualization of research networks. The result is a scholar centered, multi-perspective view of conferences and people based on their collaborations and online interactions. We measured the relevance and user acceptance of this type of interactive visualization. Preliminary results indicate a high precision both for recognized people and conferences. The majority in a group of test-users responded positively to a set of statements about the acceptance.
Collaborative Exchange of Systematic Literature Review Results: The Case of Empirical Software Engineering BIBAFull-Text 1055-1056
  Fajar J. Ekaputra; Marta Sabou; Estefanía Serral; Stefan Biffl
Complementary to managing bibliographic information as done by digital libraries, the management of concrete research objects (e.g., experimental workflows, design patterns) is a pre-requisite to foster collaboration and reuse of research results. In this paper we describe the case of the Empirical Software Engineering domain, where researchers use systematic literature reviews (SLRs) to conduct and report on literature studies. Given their structured nature, the outputs of such SLR processes are a special and complex type of research object. Since performing SLRs is a time consuming process, it is highly desirable to enable sharing and reuse of the complex knowledge structures produced through SLRs. This would enable, for example, conducting new studies that build on the findings of previous studies. To support collaborative features necessary for multiple research groups to share and reuse each other's work, we hereby propose a solution approach that is inspired by software engineering best-practices and is implemented using Semantic Web technologies.
LDP4ROs: Managing Research Objects with the W3C Linked Data Platform BIBAFull-Text 1057-1058
  Daniel Garijo; Nandana Mihindukulasooriya; Oscar Corcho
In this demo we present LDP4ROs, a prototype implementation that allows creating, browsing and updating Research Objects (ROs) and their contents using typical HTTP operations. This is achieved by aligning the RO model with the W3C Linked Data Platform (LDP).
Visual-Based Classification of Figures from Scientific Literature BIBAFull-Text 1059-1060
  Theodoros Giannakopoulos; Ioannis Foufoulas; Eleftherios Stamatogiannakis; Harry Dimitropoulos; Natalia Manola; Yannis Ioannidis
Authors of scientific publications and books use images to present a wide spectrum of information. Despite the richness of the visual content of scientific publications the figures are usually not taken into consideration in the context of text mining methodologies towards the automatic indexing and retrieval of scientific corpora. In this work, we present a system for automatic categorization of figures from scientific literature to a set of predefined classes. We have employed a wide range of visual features that achieve high discrimination ability between the adopted classes. A real-world dataset has been compiled and annotated in order to train and evaluate the proposed method using three different classification schemata.
Science Bots: A Model for the Future of Scientific Computation? BIBAFull-Text 1061-1062
  Tobias Kuhn
As a response to the trends of the increasing importance of computational approaches and the accelerating pace in science, I propose in this position paper to establish the concept of "science bots" that autonomously perform programmed tasks on input data they encounter and immediately publish the results. We can let such bots participate in a reputation system together with human users, meaning that bots and humans get positive or negative feedback by other participants. Positive reputation given to these bots would also shine on their owners, motivating them to contribute to this system, while negative reputation will allow us to filter out low-quality data, which is inevitable in an open and decentralized system.


Predicting Pinterest: Organising the World's Images with Human-machine Collaboration BIBAFull-Text 1065
  Nishanth Sastry
The user generated content revolution has created a glut of multimedia content online -- from Flickr to Facebook, new images are being made available for public consumption everyday. In this talk, we will first explore how, on sites such as Pinterest, users are bringing order to this burgeoning collection by manually curating collections of images in ways that are highly personalised and relevant to their own use. We will then discuss the phenomenon of social bootstrapping, whereby existing mature social networks such as Facebook are helping bootstrap engaged communities of content curators on external sites such as Pinterest. Finally, we will demonstrate how the manual effort involved in curation can be amplified using a unique human-machine collaboration: By treating the curation efforts of a subset of users on Pinterest as a distributed human computation over a low-dimensional approximation of the content corpus, we derive simple yet powerful signals, which, when combined with image-related features drawn from state-of-the-art deep learning techniques, allow us to automatically and accurately populate the personalised curated collections of all other users.
Challenges of Forecasting and Measuring a Complex Networked World BIBAFull-Text 1067
  Bruno Ribeiro
A new era of data analytics of online social networks promises tremendous high-impact societal, business, and healthcare applications. As more users join online social networks, the data available for analysis and forecast of human social and collective behavior grows at an incredible pace. The first part of this talk introduces an apparent paradox, where larger online social networks entail more user data but also less analytic and forecasting capabilities [7]. More specifically, the paradox applies to forecasting properties of network processes such as network cascades, showing that in some scenarios unbiased long term forecasting becomes increasingly inaccurate as the network grows but, paradoxically, short term forecasting -- such as the predictions in Cheng et al. [2] and Ribeiro et al. [7] -- improves with network size. We discuss the theoretic foundations of this paradox and its connections with known information theoretic measures such as Shannon capacity. We also discuss the implications of this paradox on the scalability of big data applications and show how information theory tools -- such as Fisher information [3,8] -- can be used to design more accurate and scalable methods for network analytics [6,8,10]. The second part of the talk focuses on how these results impact our ability to perform network analytics when network data is only available through crawlers and the complete network topology is unknown [1,4,5,9].
Understanding Complex Networks Using Graph Spectrum BIBAFull-Text 1069-1072
  Yanhua Li; Zhi-Li Zhang
Complex networks are becoming indispensable parts of our lives. The Internet, wireless (cellular) networks, online social networks, and transportation networks are examples of some well-known complex networks around us. These networks generate an immense range of big data: weblogs, social media, the Internet traffic, which have increasingly drawn attentions from the computer science research community to explore and investigate the fundamental properties of, and improve the user experiences on, these complex networks. This work focuses on understanding complex networks based on the graph spectrum, namely, developing and applying spectral graph theories and models for understanding and employing versatile and oblivious network information -- asymmetrical characteristics of the wireless transmission channels, multiplex social relations, e.g., trust and distrust relations, etc -- in solving various application problems, such as estimating transmission cost in wireless networks, Internet traffic engineering, and social influence analysis in social networks.
Pivotality of Nodes in Reachability Problems Using Avoidance and Transit Hitting Time Metrics BIBAFull-Text 1073-1078
  Golshan Golnari; Yanhua Li; Zhi-Li Zhang
Reachability is crucial to many network operations in various complex networks. More often than not, however, it is not sufficient simply to know whether a source node s can reach a target node t in the network. Additional information associated with reachability such as how long or how many possible ways node s may take to reach node t. In this paper we analyze another piece of important information associated with reachability -- which we call pivotality. Pivotality captures how pivotal a role that a node k or a subset of nodes S may play in the reachability from node s to node t in a given network. We propose two important metrics, the avoidance and transit hitting times, which extend and generalize the classical notion of hitting times. We show these metrics can be computed from the fundamental matrices associated with the appropriately defined random walk transition probability matrices and prove that the classical hitting time from a source to a target can be decomposed into the avoidance and transit hitting times with respect to any third node. Through simulated and real-world network examples, we demonstrate that these metrics provide a powerful ranking tool for the nodes based on their pivotality in the reachability.
Rise and Fall of Online Game Groups: Common Findings on Two Different Games BIBAFull-Text 1079-1084
  Ah Reum Kang; Juyong Park; Jina Lee; Huy Kang Kim
Among many types of online games, Massively Multiplayer Online Role Playing Games (MMORPGs) provide players with the most realistic gaming experience inspired by the real, offline world. In particular, much stress is put upon socializing and collaboration with others as a condition for one's success, just as in real life. An advantage of studying MMORPGs is that since all actions are recorded, we can observe phenomena that are hard to observe in real life. For instance, we could observe how the all-important collaboration between people come into being, evolve, and eventually die out from the data to gain valuable insights to the group dynamics. In this paper, we analyzed the successes and failures of the online game groups in two different MMORPG, ArcheAge of XLGames, Inc. and Aion of NCsoft, Inc.. We find that there exist factors that influence the dynamics of group growth common to the games regardless of the games' maturity.
Finding Relevant Indian Judgments using Dispersion of Citation Network BIBAFull-Text 1085-1088
  Akshay Minocha; Navjyoti Singh; Arjit Srivastava
We construct a complex citation network of a subset of Indian Constitutional Articles and the legal judgments that invoke them. We describe, how this dataset is constructed and also introduce the term of dispersion from network science related to social networks, in the context of legal relevance. Our research shows that dispersion is a decisive structural feature to show the importance of relevant legal judgments and landmark decisions. Our method provides similarity information about the document in question, which otherwise remains undetected by standard citation metrics.
On Skewed Distributions and Straight Lines: A Case Study on the Wiki Collaboration Network BIBAFull-Text 1089-1094
  Osnat Mokryn; Alexey Reznik
In this paper, we present a hypothesis that power laws are found only in datasets sampled from a static data, in which each and every item has gained its maximal importance and is not in the process of changing it during the sampling period. We motivate our hypothesis by examining languages, and word-ranking distribution as it appears in books, and in the Bible. To demonstrate the validity of our hypothesis, we experiment with the Wikipedia edit collaboration network. We find that the dataset fits a skewed distribution. Next, we identify its dynamic part. We then show that when the modified part is removed from the obtained dataset, the remaining static part exhibits a good fit to a power law distribution.
Distributed Community Detection with the WCC Metric BIBAFull-Text 1095-1100
  Matthew Saltz; Arnau Prat-Pérez; David Dominguez-Sal
Community detection has become an extremely active area of research in recent years, with researchers proposing various new metrics and algorithms to address the problem. Recently, the Weighted Community Clustering (WCC) metric was proposed as a novel way to judge the quality of a community partitioning based on the distribution of triangles in the graph, and was demonstrated to yield superior results over other commonly used metrics like modularity. The same authors later presented a parallel algorithm for optimizing WCC on large graphs. In this paper, we propose a new distributed, vertex-centric algorithm for community detection using the WCC metric. Results are presented that demonstrate the algorithm's performance and scalability on up to 32 worker machines and real graphs of up to 1.8 billion edges. The algorithm scales best with the largest graphs, finishing in just over an hour for the largest graph, and to our knowledge, it is the first distributed algorithm for optimizing the WCC metric.

SocialNLP 2015

Mining Social and Urban Big Data BIBAFull-Text 1103
  Nicholas Jing Yuan
In recent years, with the rapid development of positioning technologies, online social networks, sensors and smart devices, large scale human behavioral data are now readily available. The growing availability of such behavioral data provides us unprecedented opportunities to gain more in depth understanding of users in both the physical world and cyber world, especially in online social networks. In this talk, I will introduce our recent research efforts in social and urban mining based on large-scale human behavioral datasets showcased by two projects: 1) LifeSpec: Modeling the spectrum of urban lifestyles based on heterogeneous online social network data. 2) L2P: Inferring demographic attributes from location check-ins.
Expert-Guided Contrastive Opinion Summarization for Controversial Issues BIBAFull-Text 1105-1110
  Jinlong Guo; Yujie Lu; Tatsunori Mori; Catherine Blake
This paper presents a new model for the task of contrastive opinion summarization (COS) particularly for controversial issues. Traditional COS methods, which mainly rely on sentence similarity measures are not sufficient for a complex controversial issue. We therefore propose an Expert-Guided Contrastive Opinion Summarization (ECOS) model. Compared to previous methods, our model can (1) integrate expert opinions with ordinary opinions from social media and (2) better align the contrastive arguments under the guidance of expert prior opinion. We create a new data set about a complex social issue with "sufficient" controversy and experimental results on this data show that the proposed model are effective for (1) producing better arguments summary in understanding a controversial issue and (2) generating contrastive sentence pairs.
ResToRinG CaPitaLiZaTion in #TweeTs BIBAFull-Text 1111-1115
  Kamel Nebhi; Kalina Bontcheva; Genevieve Gorrell
The rapid proliferation of microblogs such as Twitter has resulted in a vast quantity of written text becoming available that contains interesting information for NLP tasks. However, the noise level in tweets is so high that standard NLP tools perform poorly. In this paper, we present a statistical truecaser for tweets using a 3-gram language model built with truecased newswire texts and tweets. Our truecasing method shows an improvement in named entity recognition and part-of-speech tagging tasks.
Supervised Prediction of Social Network Links Using Implicit Sources of Information BIBAFull-Text 1117-1122
  Ervin Tasnádi; Gábor Berend
In this paper, we introduce a supervised machine learning framework for the link prediction problem. The social network we conducted our empirical evaluation on originates from the restaurant review portal, yelp.com. The proposed framework not only uses the structure of the social network to predict non-existing edges in it, but also makes use of further graphs that were constructed based on implicit information provided in the dataset. The implicit information we relied on includes the language use of the members of the social network and their ratings with respect the businesses they reviewed. Here, we also investigate the possibility of building supervised learning models to predict social links without relying on features derived from the structure of the social network itself, but based on such implicit information alone. Our empirical results not only revealed that the features derived from different sources of implicit information can be useful on their own, but also that incorporating them in a unified framework has the potential to improve classification results, as the different sources of implicit information can provide independent and useful views about the connectedness of users.

SOCM 2015

An Explorative Approach for Crowdsourcing Tasks Design BIBAFull-Text 1125-1130
  Marco Brambilla; Stefano Ceri; Andrea Mauri; Riccardo Volonterio
Crowdsourcing applications are becoming widespread; they cover very different scenarios, including opinion mining, multimedia data annotation, localised information gathering, marketing campaigns, expert response gathering, and so on. The quality of the outcome of these applications depends on different design parameters and constraints, and it is very hard to judge about their combined effects without doing some experiments; on the other hand, there are no experiences or guidelines that tell how to conduct experiments, and thus these are often conducted in an ad-hoc manner, typically through adjustments of an initial strategy that may converge to a parameter setting which is quite different from the best possible one. In this paper we propose a comparative, explorative approach for designing crowdsourcing tasks. The method consists of defining a representative set of execution strategies, then execute them on a small dataset, then collect quality measures for each candidate strategy, and finally decide the strategy to be used with the complete dataset.
Towards Government as a Social Machine BIBAFull-Text 1131-1136
  Vanilson Burégio; Kellyton Brito; Nelson Rosa; Misael Neto; Vinícius Garcia; Silvio Meira
Government initiatives to open data to the public are becoming increasingly popular every day. The vast amount of data made available by government organizations yields interesting opportunities and challenges -- both socially and technically. In this paper, we propose a social machine-oriented architecture as a way to extend the power of open data and create the basis to derive government as a social machine (Gov-SM). The proposed Gov-SM combines principles from existing architectural patterns and provides a platform of specialized APIs to enable the creation of several other social-technical systems on top of it. Based on some implementation experiences, we believe that deriving government as a social machine can, in more than one sense, collaborate to fully integrate users, developers and crowd in order to participate in and solve a multitude of governmental issues and policy.
When Resources Collide: Towards a Theory of Coincidence in Information Spaces BIBAFull-Text 1137-1142
  Markus Luczak-Roesch; Ramine Tinati; Nigel Shadbolt
This paper is an attempt to lay out foundations for a general theory of coincidence in information spaces such as the World Wide Web, expanding on existing work on bursty structures in document streams and information cascades. We elaborate on the hypothesis that every resource that is published in an information space, enters a temporary interaction with another resource once a unique explicit or implicit reference between the two is found. This thought is motivated by Erwin Shroedingers notion of entanglement between quantum systems. We present a generic information cascade model that exploits only the temporal order of information sharing activities, combined with inherent properties of the shared information resources. The approach was applied to data from the world's largest online citizen science platform Zooniverse and we report about findings of this case study.
On Wayfaring in Social Machines BIBAFull-Text 1143-1148
  Dave Murray-Rust; Segolene Tarte; Mark Hartswood; Owen Green
In this paper, we concern ourselves with the ways in which humans inhabit social machines: the structures and techniques which allow the enmeshing of multiple life traces within the flow of online interaction. In particular, we explore the distinction between transport and journeying, between networks and meshworks, and the different attitudes and modes of being appropriate to each. By doing this, we hope to capture a part of the sociality of social machines, to build an understanding of the ways in which lived lives relate to digital structures, and the emergence of the communality of shared work. In order to illustrate these ideas, we look at several aspects of existing social machines, and tease apart the qualities which relate to the different modes of being. The distinctions and concepts outlined here provide another element in both the analysis and development of social machines, understanding how people may joyfully and directedly engage with collective activities on the web.
A Streaming Real-Time Web Observatory Architecture for Monitoring the Health of Social Machines BIBAFull-Text 1149-1154
  Ramine Tinati; Xin Wang; Ian Brown; Thanassis Tiropanis; Wendy Hall
Over the past years, streaming Web services have become popular, with many of the top Web platforms now offering near real-time streams of user and machine activity. In light of this, Web Observatories now are faced with the challenge of being able to process and republish real-time, big data, Web streams, whilst maintaining access control and data consistency. In this paper we describe the architecture used in the Southampton Web Observatory to harvest, process, and serve real-time Web streams.
Social Personal Data Stores: the Nuclei of Decentralised Social Machines BIBAFull-Text 1155-1160
  Max Van Kleek; Daniel A. Smith; Dave Murray-Rust; Amy Guy; Kieron O'Hara; Laura Dragan; Nigel R. Shadbolt
Personal Data Stores are among the many efforts that are currently underway to try to re-decentralise the Web, and to bring more control and data management and storage capability under the control of the user. Few of these architectures, however, have considered the needs of supporting decentralised social software from the user's perspective. In this short paper, we present the results of our design exercise, focusing on two key design needs for building decentralised social machines: that of supporting heterogeneous social apps and multiple, separable user identities. We then present the technical design of a prototype social machine platform, INDX, which realises both of these requirements, and a prototype heterogeneous microblogging application which demonstrates its capabilities.
Revisiting the Three Rs of Social Machines: Reflexivity, Recognition and Responsivity BIBAFull-Text 1161-1166
  Jeff Vass; Jo E. Munson
This paper sets out an approach to Social Machines (SMs), their description and analysis, based on a development of social constructionist theoretical principles adapted for Web Science. We argue that currently the search for the primitives of SMs, or appropriate units of analysis to describe them, tends to favour either the technology or sociality. We suggest an approach that favours distributed agency whether it is machinic or human or both. We argue that current thinking (e.g. Actor Network Theory) is unsuited to SMs. Instead we describe an alternative which prioritizes a view of socio-technical activity as forming 'reflexive project structures'. We show that reflexivity in social systems can be further usefully divided into more fundamental elements (Recognition and Responsivity). This process enables us to capture more of the variation in SMs and to distinguish them from non-Web based socio-technical systems. We illustrate the approach by looking at different kinds of SMs showing how they relate to contemporary social theory.

SWDM 2015

Towards Next-Generation Software Infrastructure for Crisis Informatics Research BIBAFull-Text 1169
  Kenneth M. Anderson
Crisis Informatics is a multidisciplinary research area that examines the socio-technical relationships among people, information, and technology during mass emergency events. One area of crisis informatics examines the on-line behaviors of members of the public making use of social media during a crisis event to make sense of it, to report on it, and, in some cases, to coordinate a response to it either locally or from afar. In order to study those behaviors, this social media data has to be systematically captured and stored in a scalable and reliable way for later analysis. Project EPIC is a large U.S. National Science Foundation funded project that has been performing crisis informatics research since Fall 2009 and has been designing and developing a reliable and robust software infrastructure for the storage and analysis of large crisis informatics data sets. Prof. Ken Anderson has led the research and development in this software engineering effort and will discuss the challenges (both technical and social) that Project EPIC faced in developing its software infrastructure, known as EPIC Collect and EPIC Analyze. EPIC Collect has been in 24/7 operation in various forms since Spring 2010 and has collected terabytes of social media data across hundreds of mass emergency events since that time. EPIC Analyze is a data analysis platform for large social media data sets that provides efficient browsing, filtering, and collaborative annotation services. Prof. Anderson will discuss these systems and also present the challenges of collecting and analyzing social media data (with an emphasis on Twitter data) at scale. Project EPIC has designed and evaluated software architectural st