HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 0708091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2014 International Conference on the World Wide Web

Fullname:Proceedings of the 23rd International Conference on World Wide Web
Editors:Chin-Wan Chung; Andrei Broder; Kyuseok Shim; Torsten Suel
Location:Seoul, Korea
Dates:2014-Apr-07 to 2014-Apr-11
Standard No:ISBN: 978-1-4503-2744-2; ACM DL: Table of Contents; hcibib: WWW14-1
Links:Conference Website
  1. WWW 2014-04-07 Volume 1
    1. Keynote talks
    2. Internet economics & monetization 1
    3. Security 1
    4. Collaborative recommendation systems
    5. Web search 1
    6. Crowdsourcing 1
    7. Security 2
    8. Social networks 1
    9. Matching
    10. User interaction
    11. Recommendation and predictions
    12. Web mining 1
    13. Content analysis 1 -- entities
    14. User behavior
    15. Semantic web 1
    16. Web mining 2
    17. Content analysis 2 -- topics
    18. Software infrastructure
    19. Online experiments & advertising
    20. Social networks 2
    21. The future
    22. Internet economics & monetization 2
    23. Semantic web 2
    24. Web mining 3
    25. Social networks 3 -- modeling influences in graphs
    26. Content quality & popularity
    27. Conflicts & games
    28. Social networks 4 -- diffusion
    29. Web search 2

WWW 2014-04-07 Volume 1

Keynote talks

Large graph mining: patterns, cascades, fraud detection, and algorithms BIBAFull-Text 1-2
  Christos Faloutsos
Given a large graph, like who-calls-whom, or who-likes-whom, what behavior is normal and what should be surprising, possibly due to fraudulent activity? How do graphs evolve over time? How does influence/news/viruses propagate, over time? We focus on three topics: (a) anomaly detection in large static graphs (b) patterns and anomalies in large time-evolving graphs and (c) cascades and immunization. For the first, we present a list of static and temporal laws, including advances patterns like 'eigenspokes'; we show how to use them to spot suspicious activities, in on-line buyer-and-seller settings, in FaceBook, in twitter-like networks. For the second, we show how to handle time-evolving graphs as tensors, how to handle large tensors in map-reduce environments, as well as some discoveries such settings. For the third, we show that for virus propagation, a single number is enough to characterize the connectivity of graph, and thus we show how to do efficient immunization for almost any type of virus (SIS -- no immunity; SIR -- lifetime immunity; etc) We conclude with some open research questions for graph mining.
Taming the web BIBAFull-Text 3-4
  Jong-Deok Choi
The World Wide Web (WWW) has become an indispensable part of the modern life, providing many benefits in diverse ways. For instance, the huge amount of information from the web offers people unprecedented levels of opportunities for education, entertainments, social activities, productivity improvements, and business. The web, however, has also become perilous with many dangers, such as privacy violation and security breaches, and therein exist many villains who would like to turn into victims scrupulous as well as casual users. In this regard, WWW has become almost like Wild Wild West, where wonderful opportunities and great perils co-existed. Tizen (www.tizen.org) is a web-centric open-source, standards-based software platform for smart devices, such as smartphones, smart TVs, IVI (In-Vehicle Infotainment) and other consumer devices like cameras, printers, and more. Tizen is web-centric in that it directly supports web apps -- applications (apps) written in HTML5 and Javascript -- even outside the web-browsers and provides seamless supports for the web. As such, Tizen not only shares the benefits and perils of the web with other platforms, but also has the additional burden to meet the performance of non-web platforms: platforms that directly support only conventional programming languages. In this talk, we present Tizen's approaches to taming the web to maximize its benefits while minimizing the risks of its perils. We also describe various optimizations of Tizen that enable delivering web-app performance on par with that of non-web platforms.
Organizing the digital world to empower people to do more, know more, and be more BIBAFull-Text 5-6
  Qi Lu
The web is rapidly evolving into the web of the world where people, places, things and their relationships are all digitally represented. This evolution opens up unparalleled opportunities to organize this vast digital universe for even greater human purpose. In this talk, Dr. Lu will share an outline of Microsoft's quest and aspiration to organize the digital universe with a pervasive computational fabric of digital information, digital services, and digital experiences that empower every human being on the planet to accomplish more and enrich their life. Dr. Lu will discuss high level computational structures and present specific examples across Bing, Windows and other products and services to illustrate Microsoft's approach to delivering end user value and accelerating the pace of innovation for the industry as a whole.

Internet economics & monetization 1

Machine learning in an auction environment BIBAFull-Text 7-18
  Patrick Hummel; Preston McAfee
We consider a model of repeated online auctions in which an ad with an uncertain click-through rate faces a random distribution of competing bids in each auction and there is discounting of payoffs. We formulate the optimal solution to this explore/exploit problem as a dynamic programming problem and show that efficiency is maximized by making a bid for each advertiser equal to the advertiser's expected value for the advertising opportunity plus a term proportional to the variance in this value divided by the number of impressions the advertiser has received thus far. We then use this result to illustrate that the value of incorporating active exploration into a machine learning system in an auction environment is exceedingly small.
Optimal revenue-sharing double auctions with applications to ad exchanges BIBAFull-Text 19-28
  Renato Gomes; Vahab Mirrokni
E-commerce web-sites such as Ebay as well as advertising exchanges (AdX) such as DoubleClick's, RightMedia, or AdECN work as intermediaries who sell items (e.g. page-views) on behalf of a seller (e.g. a publisher) to buyers on the opposite side of the market (e.g., advertisers). These platforms often use fixed-percentage sharing schemes, according to which (i) the platform runs an auction amongst buyers, and (ii) gives the seller a constant-fraction (e.g., 80%) of the auction proceeds. In these settings, the platform faces asymmetric information regarding both the valuations of buyers for the item (as in a standard auction environment) as well as about the seller's opportunity cost of selling the item. Moreover, platforms often face intense competition from similar market places, and such competition is likely to favor auction rules that secure high payoffs to sellers. In such an environment, what selling mechanism should platforms employ? Our goal in this paper is to study optimal mechanism design in settings plagued by competition and two-sided asymmetric information, and identify conditions under which the current practice of employing constant cuts is indeed optimal.
   In particular, we first show that for a large class of competition games, platforms behave in equilibrium as if they maximize a convex combination of seller's payoffs and platform's revenue, with weight α on the seller's payoffs (which is proxy for the intensity of competition in the market). We generalize the analysis of Myerson and Satterthwaite (1983), and derive the optimal direct-revelation mechanism for each α. As expected, the optimal mechanism applies a reserve price which is decreasing in α. Next, we present an indirect implementation based on "sharing schemes". We show that constant cuts are optimal if and only if the opportunity cost of the seller has a power-form distribution, and derive a simple formula for computing the optimal constant cut as a function of the sellers' distribution of opportunity costs, and the market competition proxy α. Finally, for completeness, we study the case of a seller's optimal auction with a fixed profit for the platform, and derive the optimal direct and indirect implementations in this setting.
Advertising in a stream BIBAFull-Text 29-38
  Samuel Ieong; Mohammad Mahdian; Sergei Vassilvitskii
One of the most important innovations of social networking websites is the notion of a "feed", a sequence of news items presented to the user as a stream that expands as the user scrolls down. The common method for monetizing such streams is to insert ads in between news items. In this paper, we model this setting, and observe that allocation and pricing of ad insertions in a stream poses interesting algorithmic and mechanism design challenges. In particular, we formulate an optimization problem that captures a typical stream ad placement setting. We give an approximation algorithm for this problem that provably achieves a value close to the optimal, and show how this algorithm can be turned into an incentive compatible mechanism. Finally, we conclude with a simple practical algorithm that makes the allocation decisions in an online fashion. We prove this algorithm to be approximately welfare-maximizing and show that it also has good incentive properties.

Security 1

The company you keep: mobile malware infection rates and inexpensive risk indicators BIBAFull-Text 39-50
  Hien Thi Thu Truong; Eemil Lagerspetz; Petteri Nurmi; Adam J. Oliner; Sasu Tarkoma; N. Asokan; Sourav Bhattacharya
There is little information from independent sources in the public domain about mobile malware infection rates. The only previous independent estimate (0.0009%) [11], was based on indirect measurements obtained from domain-name resolution traces. In this paper, we present the first independent study of malware infection rates and associated risk factors using data collected directly from over 55,000 Android devices. We find that the malware infection rates in Android devices estimated using two malware datasets (0.28% and 0.26%), though small, are significantly higher than the previous independent estimate. Based on the hypothesis that some application stores have a greater density of malicious applications and that advertising within applications and cross-promotional deals may act as infection vectors, we investigate whether the set of applications used on a device can serve as an indicator for infection of that device. Our analysis indicates that, while not an accurate indicator of infection by itself, the application set does serve as an inexpensive method for identifying the pool of devices on which more expensive monitoring and analysis mechanisms should be deployed. Using our two malware datasets we show that this indicator performs up to about five times better at identifying infected devices than the baseline of random checks. Such indicators can be used, for example, in the search for new or previously undetected malware. It is therefore a technique that can complement standard malware scanning. Our analysis also demonstrates a marginally significant difference in battery use between infected and clean devices.
Stranger danger: exploring the ecosystem of ad-based URL shortening services BIBAFull-Text 51-62
  Nick Nikiforakis; Federico Maggi; Gianluca Stringhini; M. Zubair Rafique; Wouter Joosen; Christopher Kruegel; Frank Piessens; Giovanni Vigna; Stefano Zanero
URL shortening services facilitate the need of exchanging long URLs using limited space, by creating compact URL aliases that redirect users to the original URLs when followed. Some of these services show advertisements (ads) to link-clicking users and pay a commission of their advertising earnings to link-shortening users.
   In this paper, we investigate the ecosystem of these increasingly popular ad-based URL shortening services. Even though traditional URL shortening services have been thoroughly investigated in previous research, we argue that, due to the monetary incentives and the presence of third-party advertising networks, ad-based URL shortening services and their users are exposed to more hazards than traditional shortening services. By analyzing the services themselves, the advertisers involved, and their users, we uncover a series of issues that are actively exploited by malicious advertisers and endanger the users. Moreover, next to documenting the ongoing abuse, we suggest a series of defense mechanisms that services and users can adopt to protect themselves.
Automatic detection and correction of web application vulnerabilities using data mining to predict false positives BIBAFull-Text 63-74
  Ibéria Medeiros; Nuno F. Neves; Miguel Correia
Web application security is an important problem in today's internet. A major cause of this status is that many programmers do not have adequate knowledge about secure coding, so they leave applications with vulnerabilities. An approach to solve this problem is to use source code static analysis to find these bugs, but these tools are known to report many false positives that make hard the task of correcting the application. This paper explores the use of a hybrid of methods to detect vulnerabilities with less false positives. After an initial step that uses taint analysis to flag candidate vulnerabilities, our approach uses data mining to predict the existence of false positives. This approach reaches a trade-off between two apparently opposite approaches: humans coding the knowledge about vulnerabilities (for taint analysis) versus automatically obtaining that knowledge (with machine learning, for data mining). Given this more precise form of detection, we do automatic code correction by inserting fixes in the source code. The approach was implemented in the WAP tool and an experimental evaluation was performed with a large set of open source PHP applications.

Collaborative recommendation systems

Personalized collaborative clustering BIBAFull-Text 75-84
  Yisong Yue; Chong Wang; Khalid El-Arini; Carlos Guestrin
We study the problem of learning personalized user models from rich user interactions. In particular, we focus on learning from clustering feedback (i.e., grouping recommended items into clusters), which enables users to express similarity or redundancy between different items. We propose and study a new machine learning problem for personalization, which we call collaborative clustering. Analogous to collaborative filtering, in collaborative clustering the goal is to leverage how existing users cluster or group items in order to predict similarity models for other users' clustering tasks. We propose a simple yet effective latent factor model to learn the variability of similarity functions across a user population. We empirically evaluate our approach using data collected from a clustering interface we developed for a goal-oriented data exploration (or sensemaking) task: asking users to explore and organize attractions in Paris. We evaluate using several realistic use cases, and show that our approach learns more effective user models than conventional clustering and metric learning approaches.
Local collaborative ranking BIBAFull-Text 85-96
  Joonseok Lee; Samy Bengio; Seungyeon Kim; Guy Lebanon; Yoram Singer
Personalized recommendation systems are used in a wide variety of applications such as electronic commerce, social networks, web search, and more. Collaborative filtering approaches to recommendation systems typically assume that the rating matrix (e.g., movie ratings by viewers) is low-rank. In this paper, we examine an alternative approach in which the rating matrix is locally low-rank. Concretely, we assume that the rating matrix is low-rank within certain neighborhoods of the metric space defined by (user, item) pairs. We combine a recent approach for local low-rank approximation based on the Frobenius norm with a general empirical risk minimization for ranking losses. Our experiments indicate that the combination of a mixture of local low-rank matrices each of which was trained to minimize a ranking loss outperforms many of the currently used state-of-the-art recommendation systems. Moreover, our method is easy to parallelize, making it a viable approach for large scale real-world rank-based recommendation systems.
CoBaFi: collaborative Bayesian filtering BIBAFull-Text 97-108
  Alex Beutel; Kenton Murray; Christos Faloutsos; Alexander J. Smola
Given a large dataset of users' ratings of movies, what is the best model to accurately predict which movies a person will like? And how can we prevent spammers from tricking our algorithms into suggesting a bad movie? Is it possible to infer structure between movies simultaneously? In this paper we describe a unified Bayesian approach to Collaborative Filtering that accomplishes all of these goals. It models the discrete structure of ratings and is flexible to the often non-Gaussian shape of the distribution. Additionally, our method finds a co-clustering of the users and items, which improves the model's accuracy and makes the model robust to fraud. We offer three main contributions: (1) We provide a novel model and Gibbs sampling algorithm that accurately models the quirks of real world ratings, such as convex ratings distributions. (2) We provide proof of our model's robustness to spam and anomalous behavior. (3) We use several real world datasets to demonstrate the model's effectiveness in accurately predicting user's ratings, avoiding prediction skew in the face of injected spam, and finding interesting patterns in real world ratings data.

Web search 1

Efficient estimation for high similarities using odd sketches BIBAFull-Text 109-118
  Michael Mitzenmacher; Rasmus Pagh; Ninh Pham
Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.
Composite retrieval of heterogeneous web search BIBAFull-Text 119-130
  Horatiu Bota; Ke Zhou; Joemon M. Jose; Mounia Lalmas
Traditional search systems generally present a ranked list of documents as answers to user queries. In aggregated search systems, results from different and increasingly diverse verticals (image, video, news, etc.) are returned to users. For instance, many such search engines return to users both images and web documents as answers to the query "flower". Aggregated search has become a very popular paradigm. In this paper, we go one step further and study a different search paradigm: composite retrieval. Rather than returning and merging results from different verticals, as is the case with aggregated search, we propose to return to users a set of "bundles", where a bundle is composed of "cohesive" results from several verticals. For example, for the query "London Olympic", one bundle per sport could be returned, each containing results extracted from news, videos, images, or Wikipedia. Composite retrieval can promote exploratory search in a way that helps users understand the diversity of results available for a specific query and decide what to explore in more detail. In this paper, we propose and evaluate a variety of approaches to construct bundles that are relevant, cohesive and diverse. Compared with three baselines (traditional "general web only" ranking, federated search ranking and aggregated search), our evaluation results demonstrate significant performance improvement for a highly heterogeneous web collection.
Contextual and dimensional relevance judgments for reusable SERP-level evaluation BIBAFull-Text 131-142
  Peter B. Golbus; Imed Zitouni; Jin Young Kim; Ahmed Hassan; Fernando Diaz
Document-level relevance judgments are a major component in the calculation of effectiveness metrics. Collecting high-quality judgments is therefore a critical step in information retrieval evaluation. However, the nature of and the assumptions underlying relevance judgment collection have not received much attention. In particular, relevance judgments are typically collected for each document in isolation, although users read each document in the context of other documents. In this work, we aim to investigate the nature of relevance judgment collection. We collect relevance labels in both isolated and conditional setting, and ask for judgments in various dimensions of relevance as well as overall relevance. Then we compare the relevance metrics based on various types of judgments with other metrics of quality such as user preference. Our analyses illuminate how these settings for judgment collection affect the quality and the characteristics of the judgments. We also find that the metrics based on conditional judgments show higher correlation with user preference than isolated judgments.

Crowdsourcing 1

Quizz: targeted crowdsourcing with a billion (potential) users BIBAFull-Text 143-154
  Panagiotis G. Ipeirotis; Evgeniy Gabrilovich
We describe Quizz, a gamified crowdsourcing system that simultaneously assesses the knowledge of users and acquires new knowledge from them. Quizz operates by asking users to complete short quizzes on specific topics; as a user answers the quiz questions, Quizz estimates the user's competence. To acquire new knowledge, Quizz also incorporates questions for which we do not have a known answer; the answers given by competent users provide useful signals for selecting the correct answers for these questions. Quizz actively tries to identify knowledgeable users on the Internet by running advertising campaigns, effectively leveraging the targeting capabilities of existing, publicly available, ad placement services. Quizz quantifies the contributions of the users using information theory and sends feedback to the advertising system about each user. The feedback allows the ad targeting mechanism to further optimize ad placement.
   Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute. Finally, our cost-quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.
Community-based Bayesian aggregation models for crowdsourcing BIBAFull-Text 155-164
  Matteo Venanzi; John Guiver; Gabriella Kazai; Pushmeet Kohli; Milad Shokouhi
This paper addresses the problem of extracting accurate labels from crowdsourced datasets, a key challenge in crowdsourcing. Prior work has focused on modeling the reliability of individual workers, for instance, by way of confusion matrices, and using these latent traits to estimate the true labels more accurately. However, this strategy becomes ineffective when there are too few labels per worker to reliably estimate their quality. To mitigate this issue, we propose a novel community-based Bayesian label aggregation model, CommunityBCC, which assumes that crowd workers conform to a few different types, where each type represents a group of workers with similar confusion matrices. We assume that each worker belongs to a certain community, where the worker's confusion matrix is similar to (a perturbation of) the community's confusion matrix. Our model can then learn a set of key latent features: (i) the confusion matrix of each community, (ii) the community membership of each user, and (iii) the aggregated label of each item. We compare the performance of our model against established aggregation methods on a number of large-scale, real-world crowdsourcing datasets. Our experimental results show that our CommunityBCC model consistently outperforms state-of-the-art label aggregation methods, requiring, on average, 50% less data to pass the 90% accuracy mark.
The wisdom of minority: discovering and targeting the right group of workers for crowdsourcing BIBAFull-Text 165-176
  Hongwei Li; Bo Zhao; Ariel Fuxman
Worker reliability is a longstanding issue in crowdsourcing, and the automatic discovery of high quality workers is an important practical problem. Most previous work on this problem mainly focuses on estimating the quality of each individual worker jointly with the true answer of each task. However, in practice, for some tasks, worker quality could be associated with some explicit characteristics of the worker, such as education level, major and age. So the following question arises: how do we automatically discover related worker attributes for a given task, and further utilize the findings to improve data quality? In this paper, we propose a general crowd targeting framework that can automatically discover, for a given task, if any group of workers based on their attributes have higher quality on average; and target such groups, if they exist, for future work on the same task. Our crowd targeting framework is complementary to traditional worker quality estimation approaches. Furthermore, an advantage of our framework is that it is more budget efficient because we are able to target potentially good workers before they actually do the task. Experiments on real datasets show that the accuracy of final prediction can be improved significantly for the same budget (or even less budget in some cases). Our framework can be applied to many real word tasks and can be easily integrated in current crowdsourcing platforms.

Security 2

Monitoring web browsing behavior with differential privacy BIBAFull-Text 177-188
  Liyue Fan; Luca Bonomi; Li Xiong; Vaidy Sunderam
Monitoring web browsing behavior has benefited many data mining applications, such as top-K discovery and anomaly detection. However, releasing private user data to the greater public would concern web users about their privacy, especially after the incident of AOL search log release where anonymization was not correctly done. In this paper, we adopt differential privacy, a strong, provable privacy definition, and show that differentially private aggregates of web browsing activities can be released in real-time while preserving the utility of shared data. Our proposed algorithms utilize the rich correlation of the time series of aggregated data and adopt a state-space approach to estimate the underlying, true aggregates from the perturbed values by the differential privacy mechanism. We evaluate our algorithms with real-world web browsing data. Utility evaluations with three metrics demonstrate that the quality of the private, released data by our solutions closely resembles that of the original, unperturbed aggregates.
Quite a mess in my cookie jar!: leveraging machine learning to protect web authentication BIBAFull-Text 189-200
  Stefano Calzavara; Gabriele Tolomei; Michele Bugliesi; Salvatore Orlando
Browser-based defenses have recently been advocated as an effective mechanism to protect web applications against the threats of session hijacking, fixation, and related attacks. In existing approaches, all such defenses ultimately rely on client-side heuristics to automatically detect cookies containing session information, to then protect them against theft or otherwise unintended use. While clearly crucial to the effectiveness of the resulting defense mechanisms, these heuristics have not, as yet, undergone any rigorous assessment of their adequacy. In this paper, we conduct the first such formal assessment, based on a gold set of cookies we collect from 70 popular websites of the Alexa ranking. To obtain the gold set, we devise a semi-automatic procedure that draws on a novel notion of authentication token, which we introduce to capture multiple web authentication schemes. We test existing browser-based defenses in the literature against our gold set, unveiling several pitfalls both in the heuristics adopted and in the methods used to assess them. We then propose a new detection method based on supervised learning, where our gold set is used to train a binary classifier, and report on experimental evidence that our method outperforms existing proposals. Interestingly, the resulting classification, together with our hands-on experience in the construction of the gold set, provides new insight on how web authentication is implemented in practice.
Reconciling mobile app privacy and usability on smartphones: could user privacy profiles help? BIBAFull-Text 201-212
  Bin Liu; Jialiu Lin; Norman Sadeh
As they compete for developers, mobile app ecosystems have been exposing a growing number of APIs through their software development kits. Many of these APIs involve accessing sensitive functionality and/or user data and require approval by users. Android for instance allows developers to select from over 130 possible permissions. Expecting users to review and possibly adjust settings related to these permissions has proven unrealistic. In this paper, we report on the results of a study analyzing people's privacy preferences when it comes to granting permissions to different mobile apps. Our results suggest that, while people's mobile app privacy preferences are diverse, a relatively small number of profiles can be identified that offer the promise of significantly simplifying the decisions mobile users have to make. Specifically, our results are based on the analysis of settings of 4.8 million smartphone users of a mobile security and privacy platform. The platform relies on a rooted version of Android where users are allowed to choose between "granting", "denying" or "requesting to be dynamically prompted" when it comes to granting 12 different Android permissions to mobile apps they have downloaded.

Social networks 1

Random walks based modularity: application to semi-supervised learning BIBAFull-Text 213-224
  Robin Devooght; Amin Mantrach; Ilkka Kivimäki; Hugues Bersini; Alejandro Jaimes; Marco Saerens
Although criticized for some of its limitations, modularity remains a standard measure for analyzing social networks. Quantifying the statistical surprise in the arrangement of the edges of the network has led to simple and powerful algorithms. However, relying solely on the distribution of edges instead of more complex structures such as paths limits the extent of modularity. Indeed, recent studies have shown restrictions of optimizing modularity, for instance its resolution limit. We introduce here a novel, formal and well-defined modularity measure based on random walks. We show how this modularity can be computed from paths induced by the graph instead of the traditionally used edges. We argue that by computing modularity on paths instead of edges, more informative features can be extracted from the network. We verify this hypothesis on a semi-supervised classification procedure of the nodes in the network, where we show that, under the same settings, the features of the random walk modularity help to classify better than the features of the usual modularity. Additionally, the proposed approach outperforms the classical label propagation procedure on two data sets of labeled social networks.
High quality, scalable and parallel community detection for large real graphs BIBAFull-Text 225-236
  Arnau Prat-Pérez; David Dominguez-Sal; Josep-Lluis Larriba-Pey
Community detection has arisen as one of the most relevant topics in the field of graph mining, principally for its applications in domains such as social or biological networks analysis. Different community detection algorithms have been proposed during the last decade, approaching the problem from different perspectives. However, existing algorithms are, in general, based on complex and expensive computations, making them unsuitable for large graphs with millions of vertices and edges such as those usually found in the real world.
   In this paper, we propose a novel disjoint community detection algorithm called Scalable Community Detection (SCD). By combining different strategies, SCD partitions the graph by maximizing the Weighted Community Clustering (WCC), a recently proposed community detection metric based on triangle analysis. Using real graphs with ground truth overlapped communities, we show that SCD outperforms the current state of the art proposals (even those aimed at finding overlapping communities) in terms of quality and performance. SCD provides the speed of the fastest algorithms and the quality in terms of NMI and F1Score of the most accurate state of the art proposals. We show that SCD is able to run up to two orders of magnitude faster than practical existing solutions by exploiting the parallelism of current multi-core processors, enabling us to process graphs of unprecedented size in short execution times.
Dynamic and historical shortest-path distance queries on large evolving networks by pruned landmark labeling BIBAFull-Text 237-248
  Takuya Akiba; Yoichi Iwata; Yuichi Yoshida
We propose two dynamic indexing schemes for shortest-path and distance queries on large time-evolving graphs, which are useful in a wide range of important applications such as real-time network-aware search and network evolution analysis. To the best of our knowledge, these methods are the first practical exact indexing methods to efficiently process distance queries and dynamic graph updates. We first propose a dynamic indexing scheme for queries on the last snapshot. The scalability and efficiency of its offline indexing algorithm and query algorithm are competitive even with previous static methods. Meanwhile, the method is dynamic, that is, it can incrementally update indices as the graph changes over time. Then, we further design another dynamic indexing scheme that can also answer two kinds of historical queries with regard to not only the latest snapshot but also previous snapshots.
   Through extensive experiments on real and synthetic evolving networks, we show the scalability and efficiency of our methods. Specifically, they can construct indices from large graphs with millions of vertices, answer queries in microseconds, and update indices in milliseconds.


To gather together for a better world: understanding and leveraging communities in micro-lending recommendation BIBAFull-Text 249-260
  Jaegul Choo; Daniel Lee; Bistra Dilkina; Hongyuan Zha; Haesun Park
Micro-finance organizations provide non-profit lending opportunities to mitigate poverty by financially supporting impoverished, yet skilled entrepreneurs who are in desperate need of an institution that lends to them. In Kiva.org, a widely-used crowd-funded micro-financial service, a vast amount of micro-financial activities are done by lending teams, and thus, understanding their diverse characteristics is crucial in maintaining a healthy micro-finance ecosystem. As the first step for this goal, we model different lending teams by using a maximum-entropy distribution approach based on a wealthy set of heterogeneous information regarding micro-financial transactions available at Kiva. Based on this approach, we achieved a competitive performance in predicting the lending activities for the top 200 teams. Furthermore, we provide deep insight about the characteristics of lending teams by analyzing the resulting team-specific lending models. We found that lending teams are generally more careful in selecting loans by a loan's geo-location, a borrower's gender, a field partner's reliability, etc., when compared to lenders without team affiliations. In addition, we identified interesting lending behaviors of different lending teams based on lenders' background and interest such as their ethnic, religious, linguistic, educational, regional, and occupational aspects. Finally, using our proposed model, we tackled a novel problem of lending team recommendation and showed its promising performance results.
Recommending investors for crowdfunding projects BIBAFull-Text 261-270
  Jisun An; Daniele Quercia; Jon Crowcroft
To bring their innovative ideas to market, those embarking in new ventures have to raise money, and, to do so, they have often resorted to banks and venture capitalists. Nowadays, they have an additional option: that of crowdfunding. The name refers to the idea that funds come from a network of people on the Internet who are passionate about supporting others' projects. One of the most popular crowdfunding sites is Kickstarter. In it, creators post descriptions of their projects and advertise them on social media sites (mainly Twitter), while investors look for projects to support. The most common reason for project failure is the inability of founders to connect with a sufficient number of investors, and that is mainly because hitherto there has not been any automatic way of matching creators and investors. We thus set out to propose different ways of recommending investors found on Twitter for specific Kickstarter projects. We do so by conducting hypothesis-driven analyses of pledging behavior and translate the corresponding findings into different recommendation strategies. The best strategy achieves, on average, 84% of accuracy in predicting a list of potential investors' Twitter accounts for any given project. Our findings also produced key insights about the whys and wherefores of investors deciding to support innovative efforts.
Understanding spatial homophily: the case of peer influence and social selection BIBAFull-Text 271-282
  Ke Zhang; Konstantinos Pelechrinis
Homophily is a phenomenon observed very frequently in social networks and is related with the inclination of people to be involved with others that exhibit similar characteristics. The roots of homophily can be subtle and are mainly traced back to two mechanisms: (i) social selection and (ii) peer influence. Decomposing the effects of each of these mechanisms requires analysis of longitudinal data. This has been a burden to similar studies in traditional social sciences due to the hardness of collecting such information. However, the proliferation of online social media has enabled the collection of massive amounts of information related with human activities. In this work, we are interested in examining the forces of the above mechanisms in the context of the locations visited by people. For our study, we use a longitudinal dataset collected from Gowalla, a location-based social network (LBSN). LBSNs, unlike other online social media, bond users' online interactions with their activities in real-world, physical locations. Prior work on LBSNs has focused on the influence of geographical constraints on the formation of social ties. On the contrary, in this paper, we perform a microscopic study of the peer influence and social selection mechanisms in LBSNs. Our analysis indicates that while the similarity of friends' spatial trails at a geographically global scale cannot be attributed to peer influence, the latter can explain up to 40% of the geographically localized similarity between friends. Moreover, this percentage depends on the type of locations we examine, and it can be even higher for specific categories (e.g., nightlife spots). Finally, we find that the social selection mechanism, is only triggered by places that exhibit specific network characteristics. We believe that our work can have significant implications on obtaining a deeper understanding of the way that people create friendships, act and move in real space, which can further facilitate and enhance applications such as recommender systems, trip planning and marketing.

User interaction

Designing and deploying online field experiments BIBAFull-Text 283-292
  Eytan Bakshy; Dean Eckles; Michael S. Bernstein
Online experiments are widely used to compare specific design alternatives, but they can also be used to produce generalizable knowledge and inform strategic decision making. Doing so often requires sophisticated experimental designs, iterative refinement, and careful logging and analysis. Few tools exist that support these needs. We thus introduce a language for online field experiments called PlanOut. PlanOut separates experimental design from application code, allowing the experimenter to concisely describe experimental designs, whether common "A/B tests" and factorial designs, or more complex designs involving conditional logic or multiple experimental units. These latter designs are often useful for understanding causal mechanisms involved in user behaviors. We demonstrate how experiments from the literature can be implemented in PlanOut, and describe two large field experiments conducted on Facebook with PlanOut. For common scenarios in which experiments are run iteratively and in parallel, we introduce a namespaced management system that encourages sound experimental practice.
Local business ambience characterization through mobile audio sensing BIBAFull-Text 293-304
  He Wang; Dimitrios Lymberopoulos; Jie Liu
Local search users today decide what business to visit solely based on distance information, and business ratings that can be sparse or stale. We believe that when users search for local businesses, such as bars or restaurants, they need to know more about the ambience of each business, such as how crowded it is, how loud and of what type the music it plays is, as well as how loud the human chatter in the business is. Unfortunately, this information doesn't exist today. In this paper, we propose to automatically crowdsource such rich, local business ambience metadata through real user check-in events. Every time a user checks into a business, the phone is in user's hands, and the phone's sensors can sense the business environment. We leverage the phone's microphone during this time to infer the occupancy and human chatter levels, the music type, as well as the music and noise levels in the business. As people check-in to businesses throughout the day, business metadata can be automatically updated over time, enabling a new generation of local search experience. Using approximately 150 audio traces collected from real businesses of various types over a period of 3 months, we show that by properly extracting the temporal and frequency signatures of the audio signal, it is feasible to train models that can simultaneously infer occupancy, human chatter, music, and noise levels in a business, with higher than 79% accuracy.
Social bootstrapping: how pinterest and Last.fm social communities benefit by borrowing links from Facebook BIBAFull-Text 305-314
  Changtao Zhong; Mostafa Salehi; Sunil Shah; Marius Cobzarenco; Nishanth Sastry; Meeyoung Cha
How does one develop a new online community that is highly engaging to each user and promotes social interaction? A number of websites offer friend-finding features that help users bootstrap social networks on the website by copying links from an established network like Facebook or Twitter. This paper quantifies the extent to which such social bootstrapping is effective in enhancing a social experience of the website. First, we develop a stylised analytical model that suggests that copying tends to produce a giant connected component (i.e., a connected community) quickly and preserves properties such as reciprocity and clustering, up to a linear multiplicative factor. Second, we use data from two websites, Pinterest and Last.fm, to empirically compare the subgraph of links copied from Facebook to links created natively. We find that the copied subgraph has a giant component, higher reciprocity and clustering, and confirm that the copied connections see higher social interactions. However, the need for copying diminishes as users become more active and influential. Such users tend to create links natively on the website, to users who are more similar to them than their Facebook friends. Our findings give new insights into understanding how bootstrapping from established social networks can help engage new users by enhancing social interactivity.

Recommendation and predictions

Modeling contextual agreement in preferences BIBAFull-Text 315-326
  Loc Do; Hady W. Lauw
Personalization, or customizing the experience of each individual user, is seen as a useful way to navigate the huge variety of choices on the Web today. A key tenet of personalization is the capacity to model user preferences. The paradigm has shifted from that of individual preferences, whereby we look at a user's past activities alone, to that of shared preferences, whereby we model the similarities in preferences between pairs of users (e.g., friends, people with similar interests). However, shared preferences are still too granular, because it assumes that a pair of users would share preferences across all items. We therefore postulate the need to pay attention to "context", which refers to the specific item on which the preferences between two users are to be estimated. In this paper, we propose a generative model for contextual agreement in preferences. For every triplet consisting of two users and an item, the model estimates both the prior probability of agreement between the two users, as well as the posterior probability of agreement with respect to the item at hand. The model parameters are estimated from ratings data. To extend the model to unseen ratings, we further propose several matrix factorization techniques focused on predicting agreement, rather than ratings. Experiments on real-life data show that our model yields context-specific similarity values that perform better on a prediction task than models relying on shared preferences.
A Monte Carlo algorithm for cold start recommendation BIBAFull-Text 327-336
  Yu Rong; Xiao Wen; Hong Cheng
Recommendation systems have been widely used in E-commerce sites, social networks, etc. One of the core tasks in recommendation systems is to predict the users' ratings on items. Although many models and algorithms have been proposed, how to make accurate prediction for new users with extremely few rating records still remains a big challenge, which is called the cold start problem. Many existing methods utilize additional information, such as social graphs, to cope with the cold start problem. However, the side information may not always be available. In contrast to such methods, we propose a more general solution to address the cold start problem based on the observed user rating records only. Specifically we define a random walk on a bipartite graph of users and items to simulate the preference propagation among users, in order to alleviate the data sparsity problem for cold start users. Then we propose a Monte Carlo algorithm to estimate the similarity between different users. This algorithm takes a precomputation approach, and thus can efficiently compute the user similarity given any new user for rating prediction. In addition, our algorithm can easily handle dynamic updates and can be parallelized naturally, which are crucial for large recommendation systems. Theoretical analysis is presented to demonstrate the efficiency and effectiveness of our algorithm, and extensive experiments also confirm our theoretical findings.
Was this review helpful to you?: it depends! context and voting patterns in online content BIBAFull-Text 337-348
  Ruben Sipos; Arpita Ghosh; Thorsten Joachims
When a website hosting user-generated content asks users a straightforward question -- "Was this content helpful?" with one "Yes" and one "No" button as the two possible answers -- one might expect to get a straightforward answer. In this paper, we explore how users respond to this question and find that their responses are not quite straightforward after all. Using data from Amazon product reviews, we present evidence that users do not make absolute, independent voting decisions based on individual review quality alone. Rather, whether users vote at all, as well as the polarity of their vote for any given review, depends on the context in which they view it -- reviews receive a larger overall number of votes when they are 'misranked', and the polarity of votes becomes more positive/negative when the review is ranked lower/higher than it deserves. We distill these empirical findings into a new probabilistic model of rating behavior that includes the dependence of rating decisions on context. Understanding and formally modeling voting behavior is crucial for designing learning mechanisms and algorithms for review ranking, and we conjecture that many of our findings also apply to user behavior in other online content-rating settings.

Web mining 1

Reduce and aggregate: similarity ranking in multi-categorical bipartite graphs BIBAFull-Text 349-360
  Alessandro Epasto; Jon Feldman; Silvio Lattanzi; Stefano Leonardi; Vahab Mirrokni
We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them. We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories. Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords.
Robust multivariate autoregression for anomaly detection in dynamic product ratings BIBAFull-Text 361-372
  Nikou Günnemann; Stephan Günnemann; Christos Faloutsos
User provided rating data about products and services is one key feature of websites such as Amazon, TripAdvisor, or Yelp. Since these ratings are rather static but might change over time, a temporal analysis of rating distributions provides deeper insights into the evolution of a products' quality. Given a time-series of rating distributions, in this work, we answer the following questions: (1) How to detect the base behavior of users regarding a product's evaluation over time? (2) How to detect points in time where the rating distribution differs from this base behavior, e.g., due to attacks or spontaneous changes in the product's quality? To achieve these goals, we model the base behavior of users regarding a product as a latent multivariate autoregressive process. This latent behavior is mixed with a sparse anomaly signal finally leading to the observed data. We propose an efficient algorithm solving our objective and we present interesting findings on various real world datasets.
Mining novelty-seeking trait across heterogeneous domains BIBAFull-Text 373-384
  Fuzheng Zhang; Nicholas Jing Yuan; Defu Lian; Xing Xie
An incisive understanding of personal psychological traits is not only essential to many scientific disciplines, but also has a profound business impact on online recommendation. Recent studies in psychology suggest that novelty-seeking trait is highly related to consumer behavior. In this paper, we focus on understanding individual novelty-seeking trait embodied at different levels and across heterogeneous domains. Unlike the questionnaire-based methods widely adopted in the past, we first present a computational framework, Novel Seeking Model (NSM), for exploring the novelty-seeking trait implied by observable activities. Then, we explore the novelty-seeking trait in two heterogeneous domains: check-in behavior in location based social networks, which reflects mobility patterns in the physical world, and online shopping behavior on e-commerce sites, which reflects consumption concepts in economic activities. To demonstrate the effectiveness of NSM, we conducted extensive experiments, with a large dataset covering the two-domain activities for hundreds of thousands of individuals. Our results suggest that NSM offers a powerful paradigm for 1) presenting an effective measurement of a personality trait that can explicitly explain the deviation of individuals from the habits of individuals and crowds; 2) uncovering the correlation of novelty-seeking trait at different levels and across heterogeneous domains. The proposed method provides emerging implications for personalized cross-domain recommendation and targeted advertising.

Content analysis 1 -- entities

Discovering emerging entities with ambiguous names BIBAFull-Text 385-396
  Johannes Hoffart; Yasemin Altun; Gerhard Weikum
Knowledge bases (KB's) contain data about a large number of people, organizations, and other entities. However, this knowledge can never be complete due to the dynamics of the ever-changing world: new companies are formed every day, new songs are composed every minute and become of interest for addition to a KB. To keep up with the real world's entities, the KB maintenance process needs to continuously discover newly emerging entities in news and other Web streams. In this paper we focus on the most difficult case where the names of new entities are ambiguous. This raises the technical problem to decide whether an observed name refers to a known entity or represents a new entity. This paper presents a method to solve this problem with high accuracy. It is based on a new model of measuring the confidence of mapping an ambiguous mention to an existing entity, and a new model of representing a new entity with the same ambiguous name as a set of weighted keyphrases. The method can handle both Wikipedia-derived entities that typically constitute the bulk of large KB's as well as entities that exist only in other Web sources such as online communities about music or movies. Experiments show that our entity discovery method outperforms previous methods for coping with out-of-KB entities (called unlinkable in entity linking).
Effective named entity recognition for idiosyncratic web collections BIBAFull-Text 397-408
  Roman Prokofyev; Gianluca Demartini; Philippe Cudré-Mauroux
Named Entity Recognition (NER) plays an important role in a variety of online information management tasks including text categorization, document clustering, and faceted search. While recent NER systems can achieve near-human performance on certain documents like news articles, they still remain highly domain-specific and thus cannot effectively identify entities such as original technical concepts in scientific documents. In this work, we propose novel approaches for NER on distinctive document collections (such as scientific articles) based on n-grams inspection and classification. We design and evaluate several entity recognition features -- ranging from well-known part-of-speech tags to n-gram co-location statistics and decision trees -- to classify candidates. In addition, we show how the use of external knowledge bases (either specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic collections. We evaluate our system on two test collections created from a set of Computer Science and Physics papers and compare it against state-of-the-art supervised methods. Experimental results show that a careful combination of the features we propose yield up to 85% NER accuracy over scientific collections and substantially outperforms state-of-the-art approaches such as those based on maximum entropy.
Deduplicating a places database BIBAFull-Text 409-418
  Nilesh Dalvi; Marian Olteanu; Manish Raghavan; Philip Bohannon
We consider the problem of resolving duplicates in a database of places, where a place is defined as any entity that has a name and a physical location. When other auxiliary attributes like phone and full address are not available, deduplication based solely on names and approximate location becomes an exceptionally challenging problem that requires both domain knowledge as well an local geographical knowledge. For example, the pairs "Newpark Mall Gap Outlet" and "Newpark Mall Sears Outlet" have a high string similarity, but determining that they are different requires the domain knowledge that they represent two different store names in the same mall. Similarly, in most parts of the world, a local business called "Central Park Cafe" might simply be referred to by "Central Park", except in New York, where the keyword "Cafe" in the name becomes important to differentiate it from the famous park in the city.
   In this paper, we present a language model that can encapsulate both domain knowledge as well as local geographical knowledge. We also present unsupervised techniques that can learn such a model from a database of places. Finally, we present deduplication techniques based on such a model, and we demonstrate, using real datasets, that our techniques are much more effective than simple TF-IDF based models in resolving duplicates. Our techniques are used in production at Facebook for deduplicating the Places database.

User behavior

The dynamics of repeat consumption BIBAFull-Text 419-430
  Ashton Anderson; Ravi Kumar; Andrew Tomkins; Sergei Vassilvitskii
We study the patterns by which a user consumes the same item repeatedly over time, in a wide variety domains ranging from check-ins at the same business location to re-watches of the same video. We find that recency of consumption is the strongest predictor of repeat consumption. Based on this, we develop a model by which the item from $t$ timesteps ago is reconsumed with a probability proportional to a function of t. We study theoretical properties of this model, develop algorithms to learn reconsumption likelihood as a function of t, and show a strong fit of the resulting inferred function via a power law with exponential cutoff. We then introduce a notion of item quality, show that it alone underperforms our recency-based model, and develop a hybrid model that predicts user choice based on a combination of recency and quality. We show how the parameters of this model may be jointly estimated, and show that the resulting scheme outperforms other alternatives.
From devices to people: attribution of search activity in multi-user settings BIBAFull-Text 431-442
  Ryen W. White; Ahmed Hassan; Adish Singla; Eric Horvitz
Online services rely on unique identifiers of machines to tailor offerings to their users. An implicit assumption is made that each machine identifier maps to an individual. However, shared machines are common, leading to interwoven search histories and noisy signals for applications such as personalized search and advertising. We present methods for attributing search activity to individual searchers. Using ground truth data for a sample of almost four million U.S. Web searchers -- containing both machine identifiers and person identifiers -- we show that over half of the machine identifiers comprise the queries of multiple people. We characterize variations in features of topic, time, and other aspects such as the complexity of the information sought per the number of searchers on a machine, and show significant differences in all measures. Based on these insights, we develop models to accurately estimate when multiple people contribute to the logs ascribed to a single machine identifier. We also develop models to cluster search behavior on a machine, allowing us to attribute historical data accurately and automatically assign new search activity to the correct searcher. The findings have implications for the design of applications such as personalized search and advertising that rely heavily on machine identifiers to custom-tailor their services.
Demographics, weather and online reviews: a study of restaurant recommendations BIBAFull-Text 443-454
  Saeideh Bakhshi; Partha Kanuparthy; Eric Gilbert
Online recommendation sites are valuable information sources that people contribute to, and often use to choose restaurants. However, little is known about the dynamics behind participation in these online communities and how the recommendations in these communities are formed. In this work, we take a first look at online restaurant recommendation communities to study what endogenous (i.e., related to entities being reviewed) and exogenous factors influence people's participation in the communities, and to what extent. We analyze an online community corpus of 840K restaurants and their 1.1M associated reviews from 2002 to 2011, spread across every U.S. state. We construct models for number of reviews and ratings by community members, based on several dimensions of endogenous and exogenous factors. We find that while endogenous factors such as restaurant attributes (e.g., meal, price, service) affect recommendations, surprisingly, exogenous factors such as demographics (e.g., neighborhood diversity, education) and weather (e.g., temperature, rain, snow, season) also exert a significant effect on reviews. We find that many of the effects in online communities can be explained using offline theories from experimental psychology. Our study is the first to look at exogenous factors and how it related to online online restaurant reviews. It has implications for designing online recommendation sites, and in general, social media and online communities.

Semantic web 1

TripleProv: efficient processing of lineage queries in a native RDF store BIBAFull-Text 455-466
  Marcin Wylot; Philippe Cudre-Mauroux; Paul Groth
Given the heterogeneity of the data one can find on the Linked Data cloud, being able to trace back the provenance of query results is rapidly becoming a must-have feature of RDF systems. While provenance models have been extensively discussed in recent years, little attention has been given to the efficient implementation of provenance-enabled queries inside data stores. This paper introduces TripleProv: a new system extending a native RDF store to efficiently handle such queries. TripleProv implements two different storage models to physically co-locate lineage and instance data, and for each of them implements algorithms for tracing provenance at two granularity levels. In the following, we present the overall architecture of our system, its different lineage storage models, and the various query execution strategies we have implemented to efficiently answer provenance-enabled queries. In addition, we present the results of a comprehensive empirical evaluation of our system over two different datasets and workloads.
RDF analytics: lenses over semantic graphs BIBAFull-Text 467-478
  Dario Colazzo; François Goasdoué; Ioana Manolescu; Alexandra Roatis
The development of Semantic Web (RDF) brings new requirements for data analytics tools and methods, going beyond querying to semantics-rich analytics through warehouse-style tools. In this work, we fully redesign, from the bottom up, core data analytics concepts and tools in the context of RDF data, leading to the first complete formal framework for warehouse-style RDF analytics. Notably, we define i) analytical schemas tailored to heterogeneous, semantics-rich RDF graph, ii) analytical queries which (beyond relational cubes) allow flexible querying of the data and the schema as well as powerful aggregation and iii) OLAP-style operations. Experiments on a fully-implemented platform demonstrate the practical interest of our approach.
Formalisation and experiences of R2RML-based SPARQL to SQL query translation using morph BIBAFull-Text 479-490
  Freddy Priyatna; Oscar Corcho; Juan Sequeda
R2RML is used to specify transformations of data available in relational databases into materialised or virtual RDF datasets. SPARQL queries evaluated against virtual datasets are translated into SQL queries according to the R2RML mappings, so that they can be evaluated over the underlying relational database engines. In this paper we describe an extension of a well-known algorithm for SPARQL to SQL translation, originally formalised for RDBMS-backed triple stores, that takes into account R2RML mappings. We present the result of our implementation using queries from a synthetic benchmark and from three real use cases, and show that SPARQL queries can be in general evaluated as fast as the SQL queries that would have been generated by SQL experts if no R2RML mappings had been used.

Web mining 2

Codewebs: scalable homework search for massive open online programming courses BIBAFull-Text 491-502
  Andy Nguyen; Christopher Piech; Jonathan Huang; Leonidas Guibas
Massive open online courses (MOOCs), one of the latest internet revolutions have engendered hope that constant iterative improvement and economies of scale may cure the "cost disease" of higher education. While scalable in many ways, providing feedback for homework submissions (particularly open-ended ones) remains a challenge in the online classroom. In courses where the student-teacher ratio can be ten thousand to one or worse, it is impossible for instructors to personally give feedback to students or to understand the multitude of student approaches and pitfalls. Organizing and making sense of massive collections of homework solutions is thus a critical web problem. Despite the challenges, the dense solution space sampling in highly structured homeworks for some MOOCs suggests an elegant solution to providing quality feedback to students on a massive scale.
   We outline a method for decomposing online homework submissions into a vocabulary of "code phrases", and based on this vocabulary, we architect a queryable index that allows for fast searches into the massive dataset of student homework submissions. To demonstrate the utility of our homework search engine we index over a million code submissions from users worldwide in Stanford's Machine Learning MOOC and (a) semi-automatically learn shared structure amongst homework submissions and (b) generate specific feedback for student mistakes.
   Codewebs is a tool that leverages the redundancy of densely sampled, highly structured homeworks in order to force-multiply teacher effort. Giving articulate, instant feedback is a crucial component of the online learning process and thus by building a homework search engine we hope to take a step towards higher quality free education.
Joint question clustering and relevance prediction for open domain non-factoid question answering BIBAFull-Text 503-514
  Snigdha Chaturvedi; Vittorio Castelli; Radu Florian; Ramesh M. Nallapati; Hema Raghavan
Web searches are increasingly formulated as natural language questions, rather than keyword queries. Retrieving answers to such questions requires a degree of understanding of user expectations. An important step in this direction is to automatically infer the type of answer implied by the question, e.g., factoids, statements on a topic, instructions, reviews, etc. Answer Type taxonomies currently exist for factoid-style questions, but not for open-domain questions. Building taxonomies for non-factoid questions is a harder problem since these questions can come from a very broad semantic space. A few attempts have been made to develop taxonomies for non-factoid questions, but these tend to be too narrow or domain specific. In this paper, we address this problem by modeling the Answer Type as a latent variable that is learned in a data-driven fashion, allowing the model to be more adaptive to new domains and data sets. We propose approaches that detect the relevance of candidate answers to a user question by jointly 'clustering' questions according to the hidden variable, and modeling relevance conditioned on this hidden variable.
   In this paper we propose 3 new models: (a) Logistic Regression Mixture (LRM), (b) Glocal Logistic Regression Mixture (G-LRM) and (c) Mixture Glocal Logistic Regression Mixture (MG-LRM) that automatically learn question-clusters and cluster-specific relevance models. All three models perform better than a baseline relevance model that uses explicit Answer Type categories predicted by a supervised Answer-Type classifier, on a newsgroups dataset. Our models also perform better than a baseline relevance model that does not use any answer-type information on a blogs dataset.
Knowledge base completion via search-based question answering BIBAFull-Text 515-526
  Robert West; Evgeniy Gabrilovich; Kevin Murphy; Shaohua Sun; Rahul Gupta; Dekang Lin
Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. For example, over 70% of people included in Freebase have no known place of birth, and 99% have no known ethnicity. In this paper, we propose a way to leverage existing Web-search-based question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, we learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute. For example, if we want to find Frank Zappa's mother, we could ask the query 'who is the mother of Frank Zappa'. However, this is likely to return 'The Mothers of Invention', which was the name of his band. Our system learns that it should (in this case) add disambiguating terms, such as Zappa's place of birth, in order to make it more likely that the search results contain snippets mentioning his mother. Our system also learns how many different queries to ask for each attribute, since in some cases, asking too many can hurt accuracy (by introducing false positives). We discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute. Finally, we evaluate our system and show that it is able to extract a large number of facts with high confidence.

Content analysis 2 -- topics

A time-based collective factorization for topic discovery and monitoring in news BIBAFull-Text 527-538
  Carmen K. Vaca; Amin Mantrach; Alejandro Jaimes; Marco Saerens
Discovering and tracking topic shifts in news constitutes a new challenge for applications nowadays. Topics evolve, emerge and fade, making it more difficult for the journalist -- or the press consumer -- to decrypt the news. For instance, the current Syrian chemical crisis has been the starting point of the UN Russian initiative and also the revival of the US France alliance. A topical mapping representing how the topics evolve in time would be helpful to contextualize information. As far as we know, few topic tracking systems can provide such temporal topic connections. In this paper, we introduce a novel framework inspired from Collective Factorization for online topic discovery able to connect topics between different time-slots. The framework learns jointly the topics evolution and their time dependencies. It offers the user the ability to control, through one unique hyper-parameter, the tradeoff between the past accumulated knowledge and the current observed data. We show, on semi-synthetic datasets and on Yahoo News articles, that our method is competitive with state-of-the-art techniques while providing a simple way to monitor topics evolution (including emerging and disappearing topics).
The dual-sparse topic model: mining focused topics and focused terms in short text BIBAFull-Text 539-550
  Tianyi Lin; Wentao Tian; Qiaozhu Mei; Hong Cheng
Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate on several salient topics instead of covering a wide variety of topics. A real topic also adopts a narrow range of terms instead of a wide coverage of the vocabulary. Understanding this sparsity of information is especially important for analyzing user-generated Web content and social media, which are featured as extremely short posts and condensed discussions. In this paper, we propose a dual-sparse topic model that addresses the sparsity in both the topic mixtures and the word usage. By applying a "Spike and Slab" prior to decouple the sparsity and smoothness of the document-topic and topic-word distributions, we allow individual documents to select a few focused topics and a topic to select focused terms, respectively. Experiments on different genres of large corpora demonstrate that the dual-sparse topic model outperforms both classical topic models and existing sparsity-enhanced topic models. This improvement is especially notable on collections of short documents.
Acquisition of open-domain classes via intersective semantics BIBAFull-Text 551-562
  Marius Pasca
A weakly-supervised method acquires fine-grained class labels that do not occur verbatim in the input data or underlying text collection. The method generates more specific class labels (gold mining companies listed on the Toronto stock exchange) that capture the semantics of the underlying classes, out of pairs of input class labels (companies listed on the Toronto stock exchange, gold mining companies) available for an instance (Golden Star Resources). When applied to Wikipedia articles and their categories, the method generates new categories for existing articles, and expands existing categories with additional articles.

Software infrastructure

Automated runtime recovery for QoS-based service composition BIBAFull-Text 563-574
  Tian Huat Tan; Manman Chen; Étienne André; Jun Sun; Yang Liu; Jin Song Dong
Service composition uses existing service-based applications as components to achieve a business goal. The composite service operates in a highly dynamic environment; hence, it can fail at any time due to the failure of component services. Service composition languages such as BPEL provide a compensation mechanism to rollback the error. But such a compensation mechanism has several issues. For instance, it cannot guarantee the functional properties of the composite service after compensation. In this work, we propose an automated approach based on a genetic algorithm to calculate the recovery plan that could guarantee the satisfaction of functional properties of the composite service after recovery. Given a composite service with large state space, the proposed method does not require exploring the full state space of the composite service; therefore, it allows efficient selection of recovery plan. In addition, the selection of recovery plans is based on their quality of service (QoS). A QoS-optimal recovery plan allows effective recovery from the state of failure. Our approach has been evaluated on real-world case studies, and has shown promising results.
Similarity-based web browser optimization BIBAFull-Text 575-584
  Haoyu Wang; Mengxin Liu; Yao Guo; Xiangqun Chen
The performance of web browsers has become a major bottleneck when dealing with complex webpages. Many calculation redundancies exist when processing similar webpages, thus it is possible to cache and reuse previously calculated intermediate results to improve web browser performance significantly. In this paper, we propose a similarity-based optimization approach to improve webpage processing performance of web browsers. Through caching and reusing of style properties calculated previously, we are able to eliminate the redundancies caused by processing similar webpages from the same website. We propose a tree-structured architecture to store style properties to facilitate efficient caching and reuse. Experiments on webpages of various websites show that the proposed technique can speed up the webpage loading process by up to 68% and reduce the redundant style calculations by up to 77% for the first visit to a webpage with almost negligible overhead.
Temporal QoS-aware web service recommendation via non-negative tensor factorization BIBAFull-Text 585-596
  Wancai Zhang; Hailong Sun; Xudong Liu; Xiaohui Guo
With the rapid growth of Web Service in the past decade, the issue of QoS-aware Web service recommendation is becoming more and more critical. Since the Web service QoS information collection work requires much time and effort, and is sometimes even impractical, the service QoS value is usually missing. There are some work to predict the missing QoS value using traditional collaborative filtering methods based on user-service static model. However, the QoS value is highly related to the invocation context (e.g., QoS value are various at different time). By considering the third dynamic context information, a Temporal QoS-aware Web Service Recommendation Framework is presented to predict missing QoS value under various temporal context. Further, we formalize this problem as a generalized tensor factorization model and propose a Non-negative Tensor Factorization (NTF) algorithm which is able to deal with the triadic relations of user-service-time model. Extensive experiments are conducted based on our real-world Web service QoS dataset collected on Planet-Lab, which is comprised of service invocation response-time and throughput value from 343 users on 5817 Web services at 32 time periods. The comprehensive experimental analysis shows that our approach achieves better prediction accuracy than other approaches.

Online experiments & advertising

Adscape: harvesting and analyzing online display ads BIBAFull-Text 597-608
  Paul Barford; Igor Canadi; Darja Krushevskaja; Qiang Ma; S. Muthukrishnan
Over the past decade, advertising has emerged as the primary source of revenue for many web sites and apps. In this paper we report a first-of-its-kind study that seeks to broadly understand the features, mechanisms and dynamics of display advertising on the web -- i.e., the Adscape. Our study takes the perspective of users who are the targets of display ads shown on web sites. We develop a scalable crawling capability that enables us to gather the details of display ads including creatives and landing pages. Our crawling strategy is focused on maximizing the number of unique ads harvested. Of critical importance to our study is the recognition that a user's profile (i.e., browser profile and cookies) can have a significant impact on which ads are shown. We deploy our crawler over a variety of websites and profiles and this yields over 175K distinct display ads. We find that while targeting is widely used, there remain many instances in which delivered ads do not depend on user profile; further, ads vary more over user profiles than over websites. We also assess the population of advertisers seen and identify over 3.7K distinct entities from a variety of business segments. Finally, we find that when targeting is used, the specific types of ads delivered generally correspond with the details of user profiles, and also on users' patterns of visit.
Statistical inference in two-stage online controlled experiments with treatment selection and validation BIBAFull-Text 609-618
  Alex Deng; Tianxi Li; Yu Guo
Online controlled experiments, also called A/B testing, have been established as the mantra for data-driven decision making in many web-facing companies. A/B Testing support decision making by directly comparing two variants at a time. It can be used for comparison between (1) two candidate treatments and (2) a candidate treatment and an established control. In practice, one typically runs an experiment with multiple treatments together with a control to make decision for both purposes simultaneously. This is known to have two issues. First, having multiple treatments increases false positives due to multiple comparison. Second, the selection process causes an upward bias in estimated effect size of the best observed treatment. To overcome these two issues, a two stage process is recommended, in which we select the best treatment from the first screening stage and then run the same experiment with only the selected best treatment and the control in the validation stage. Traditional application of this two-stage design often focus only on results from the second stage. In this paper, we propose a general methodology for combining the first screening stage data together with validation stage data for more sensitive hypothesis testing and more accurate point estimation of the treatment effect. Our method is widely applicable to existing online controlled experimentation systems.
An experimental evaluation of bidders' behavior in ad auctions BIBAFull-Text 619-630
  Gali Noti; Noam Nisan; Ilan Yaniv
We performed controlled experiments of human participants in a continuous sequence of ad auctions, similar to those used by Internet companies. The goal of the research was to understand users' strategies in making bids. We studied the behavior under two auction types: (1) the Generalized Second-Price (GSP) auction and (2) the Vickrey -- Clarke -- Groves (VCG) payment rule, and manipulated also the participants' knowledge conditions: (1) explicitly given valuations and (2) payoff information from which valuations could be deduced. We found several interesting behaviors, among them are: No convergence to equilibrium was detected; moreover the frequency with which participants modified their bids increased with time. We can detect explicit "better-response" behavior rather than just mixed bidding. While bidders in GSP auctions do strategically shade their bids, they tend to bid higher than theoretically predicted by the standard VCG-like equilibrium of GSP. Bidders who are not explicitly given their valuations but can only deduce them from their gains behave a little less "precisely" than those with such explicit knowledge, but mostly during an initial learning phase. VCG and GSP yield approximately the same (high) social welfare, but GSP tends to give higher revenue.

Social networks 2

Chaff from the wheat: characterization and modeling of deleted questions on stack overflow BIBAFull-Text 631-642
  Denzil Correa; Ashish Sureka
Stack Overflow is the most popular Community based Question Answering (CQA) website for programmers on the web with 2.05M users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed guidelines on how to post questions and an ebullient moderation community. Despite these precise communications and safeguards, questions posted on Stack Overflow can be extremely off topic or very poor in quality. Such questions can be deleted from Stack Overflow at the discretion of experienced community members and moderators. We present the first study of deleted questions on Stack Overflow. We divide our study into two parts -- (i) Characterization of deleted questions over 5 years (2008-2013) of data, (ii) Prediction of deletion at the time of question creation. Our characterization study reveals multiple insights on question deletion phenomena. We find that it takes substantial time to vote a question to be deleted but once voted, the community takes swift action. We also see that question authors delete their questions to salvage reputation points. We notice some instances of accidental deletion of good quality questions but such questions are voted back to be undeleted quickly. We discover a pyramidal structure of question quality on Stack Overflow and find that deleted questions lie at the bottom (lowest quality) of the pyramid. We also build a predictive model to detect the deletion of question at the creation time. We experiment with 47 features -- based on User Profile, Community Generated, Question Content and Syntactic style -- and report an accuracy of 66%. Our findings reveal important suggestions for content quality maintenance on community based question answering websites. To the best of our knowledge, this is the first large scale study on poor quality (deleted) questions on Stack Overflow.
Timeline generation: tracking individuals on Twitter BIBAFull-Text 643-652
  Jiwei Li; Claire Cardie
In this paper, we preliminarily learn the problem of reconstructing users' life history based on the their Twitter stream and proposed an unsupervised framework that create a chronological list for personal important events (PIE) of individuals. By analyzing individual tweet collections, we find that what are suitable for inclusion in the personal timeline should be tweets talking about personal (as opposed to public) and time-specific (as opposed to time-general) topics. To further extract these types of topics, we introduce a non-parametric multi-level Dirichlet Process model to recognize four types of tweets: personal time-specific (PersonTS), personal time-general (PersonTG), public time-specific (PublicTS) and public time-general (PublicTG) topics, which, in turn, are used for further personal event extraction and timeline generation. To the best of our knowledge, this is the first work focused on the generation of timeline for individuals from Twitter data. For evaluation, we have built gold standard timelines that contain PIE related events from 20 ordinary twitter users and 20 celebrities. Experimental results demonstrate that it is feasible to automatically extract chronological timelines for Twitter users from their tweet collection.
Modeling and predicting the growth and death of membership-based websites BIBAFull-Text 653-664
  Bruno Ribeiro
Driven by outstanding success stories of Internet startups such as Facebook and The Huffington Post, recent studies have thoroughly described their growth. These highly visible online success stories, however, overshadow an untold number of similar ventures that fail. The study of website popularity is ultimately incomplete without general mechanisms that can describe both successes and failures. In this work we present six years of the daily number of users (DAU) of twenty-two membership-based websites -- encompassing online social networks, grassroots movements, online forums, and membership-only Internet stores -- well balanced between successes and failures. We then propose a combination of reaction-diffusion-decay processes whose resulting equations seem not only to describe well the observed DAU time series but also provide means to roughly predict their evolution. This model allows an approximate automatic DAU-based classification of websites into self-sustainable v.s. unsustainable and whether the startup growth is mostly driven by marketing & media campaigns or word-of-mouth adoptions.

The future

Word storms: multiples of word clouds for visual comparison of documents BIBAFull-Text 665-676
  Quim Castellà; Charles Sutton
Word clouds are popular for visualizing documents, but are not as useful for comparing documents, because identical words are not presented consistently across different clouds. We introduce the concept of word storms, a visualization tool for analyzing corpora of documents. A word storm is a group of word clouds, in which each cloud represents a single document, juxtaposed to allow the viewer to compare and contrast the documents. We present a novel algorithm that creates a coordinated word storm, in which words that appear in multiple documents are placed in the same location, using the same color and orientation, across clouds. This ensures that similar documents are represented by similar-looking word clouds, making them easier to compare and contrast visually. We evaluate the algorithm using an automatic evaluation based on document classification, and a user study. The results confirm that a coordinated word storm allows for better visual comparison of documents.
Exploring the filter bubble: the effect of using recommender systems on content diversity BIBAFull-Text 677-686
  Tien T. Nguyen; Pik-Mai Hui; F. Maxwell Harper; Loren Terveen; Joseph A. Konstan
Eli Pariser coined the term 'filter bubble' to describe the potential for online personalization to effectively isolate people from a diversity of viewpoints or content. Online recommender systems -- built on algorithms that attempt to predict which items users will most enjoy consuming -- are one family of technologies that potentially suffers from this effect. Because recommender systems have become so prevalent, it is important to investigate their impact on users in these terms. This paper examines the longitudinal impacts of a collaborative filtering-based recommender system on users. To the best of our knowledge, it is the first paper to measure the filter bubble effect in terms of content diversity at the individual level. We contribute a novel metric to measure content diversity based on information encoded in user-generated tags, and we present a new set of methods to examine the temporal effect of recommender systems on the user experience. We do find that recommender systems expose users to a slightly narrowing set of items over time. However, we also see evidence that users who actually consume the items recommended to them experience lessened narrowing effects and rate items more positively.
Engaging with massive online courses BIBAFull-Text 687-698
  Ashton Anderson; Daniel Huttenlocher; Jon Kleinberg; Jure Leskovec
The Web has enabled one of the most visible recent developments in education -- the deployment of massive open online courses. With their global reach and often staggering enrollments, MOOCs have the potential to become a major new mechanism for learning. Despite this early promise, however, MOOCs are still relatively unexplored and poorly understood.
   In a MOOC, each student's complete interaction with the course materials takes place on the Web, thus providing a record of learner activity of unprecedented scale and resolution. In this work, we use such trace data to develop a conceptual framework for understanding how users currently engage with MOOCs. We develop a taxonomy of individual behavior, examine the different behavioral patterns of high- and low-achieving students, and investigate how forum participation relates to other parts of the course.
   We also report on a large-scale deployment of badges as incentives for engagement in a MOOC, including randomized experiments in which the presentation of badges was varied across sub-populations. We find that making badges more salient produced increases in forum engagement.

Internet economics & monetization 2

User satisfaction in competitive sponsored search BIBAFull-Text 699-710
  David Kempe; Brendan Lucier
We present a model of competition between web search algorithms, and study the impact of such competition on user welfare. In our model, search providers compete for customers by strategically selecting which search results to display in response to user queries. Customers, in turn, have private preferences over search results and will tend to use search engines that are more likely to display pages satisfying their demands. Our main question is whether competition between search engines increases the overall welfare of the users (i.e., the likelihood that a user finds a page of interest). When search engines derive utility only from customers to whom they show relevant results, we show that they differentiate their results, and every equilibrium of the resulting game achieves at least half of the welfare that could be obtained by a social planner. This bound also applies whenever the likelihood of selecting a given engine is a convex function of the probability that a user's demand will be satisfied, which includes natural Markovian models of user behavior.
   On the other hand, when search engines derive utility from all customers (independent of search result relevance) and the customer demand functions are not convex, there are instances in which the (unique) equilibrium involves no differentiation between engines and a high degree of randomness in search results. This can degrade social welfare by a factor of Ω(√{NUMPAGES}) relative to the social optimum, where NUMPAGES is the number of webpages. These bad equilibria persist even when search engines can extract only small (but non-zero) expected revenue from dissatisfied users, and much higher revenue from satisfied ones.
Price competition in online combinatorial markets BIBAFull-Text 711-722
  Moshe Babaioff; Noam Nisan; Renato Paes Leme
We consider a single buyer with a combinatorial preference that would like to purchase related products and services from different vendors, where each vendor supplies exactly one product. We study the general case where subsets of products can be substitutes as well as complementary and analyze the game that is induced on the vendors, where a vendor's strategy is the price that he asks for his product. This model generalizes both Bertrand competition (where vendors are perfect substitutes) and Nash bargaining (where they are perfect complements), and captures a wide variety of scenarios that can appear in complex crowd sourcing or in automatic pricing of related products.
   We study the equilibria of such games and show that a pure efficient equilibrium always exists. In the case of submodular buyer preferences we fully characterize the set of pure Nash equilibria, essentially showing uniqueness. For the even more restricted "substitutes" buyer preferences we also prove uniqueness over mixed equilibria. Finally we begin the exploration of natural generalizations of our setting such as when services have costs, when there are multiple buyers or uncertainty about the buyer's valuation, and when a single vendor supplies multiple products.
Revenue monotone mechanisms for online advertising BIBAFull-Text 723-734
  Gagan Goel; Mohammad Reza Khani
Online advertising is an essential part of the Internet and the main source of revenue for many web-centric firms such as search engines, social networks, and online publishers. A key component of online advertising is the auction mechanism which selects and prices the set of winning ads. This work is inspired by one of the biggest practical drawbacks of the widely popular Vickrey-Clarke-Groves (VCG) mechanism, which is the unique incentive-compatible mechanism that maximizes social welfare. It is known that VCG lacks a desired property of revenue monotonicity -- a natural notion which states that the revenue of a mechanism shouldn't go down as the number of bidders increase or if the bidders increase their bids. Most firms which depend on online advertising revenue have a large sales team to attract more bidders on their inventory as the general belief is that more bidders will increase competition, and hence revenue. However, the lack of revenue monotonicity of VCG conflicts with this general belief and can be strategically confusing for the firm's business.
   In this work, we seek incentive-compatible mechanisms that are revenue-monotone. This natural property comes at the expense of social welfare -- one can show that it is not possible to get incentive-compatibility, revenue-monotonicity, and optimal social welfare simultaneously. In light of this, we introduce the notion of Price of Revenue Monotonicity (PoRM) to capture the loss in social welfare of a revenue-monotone mechanism.
   We further study revenue-monotonicity for two important online advertising scenarios. First one is the text vs image ad auction where in an ad slot, one can either show a single image ad or a few text ads. Second one is the video-pod auction where we have a video advertising slot of k seconds which can be filled with multiple video ads. For the image-text auction, we give a mechanism that satisfy both RM and IC and achieve PoRM of Σi=1k 1/i ≈ ln k. We also show that the PoRM of our mechanism is the best possible by proving a matching lower bound of Σi=1k 1/i on the PoRM of any deterministic mechanism under some mild assumptions. For the video-pod auction, we give a mechanism that achieves a PoRM of (⌊ log k ⌋ + 1) ⋅ (2 + ln k).

Semantic web 2

Semantic stability in social tagging streams BIBAFull-Text 735-746
  Claudia Wagner; Philipp Singer; Markus Strohmaier; Bernardo A. Huberman
One potential disadvantage of social tagging systems is that due to the lack of a centralized vocabulary, a crowd of users may never manage to reach a consensus on the description of resources (e.g., books, users or songs) on the Web. Yet, previous research has provided interesting evidence that the tag distributions of resources may become semantically stable over time as more and more users tag them. At the same time, previous work has raised an array of new questions such as: (i) How can we assess the semantic stability of social tagging systems in a robust and methodical way? (ii) Does semantic stabilization of tags vary across different social tagging systems and ultimately, (iii) what are the factors that can explain semantic stabilization in such systems? In this work we tackle these questions by (i) presenting a novel and robust method which overcomes a number of limitations in existing methods, (ii) empirically investigating semantic stabilization processes in a wide range of social tagging systems with distinct domains and properties and (iii) detecting potential causes for semantic stabilization, specifically imitation behavior, shared background knowledge and intrinsic properties of natural language. Our results show that tagging streams which are generated by a combination of imitation dynamics and shared background knowledge exhibit faster and higher semantic stability than tagging streams which are generated via imitation dynamics or natural language phenomena alone.
Test-driven evaluation of linked data quality BIBAFull-Text 747-758
  Dimitris Kontokostas; Patrick Westphal; Sören Auer; Sebastian Hellmann; Jens Lehmann; Roland Cornelissen; Amrapali Zaveri
Linked Open Data (LOD) comprises an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with Linked Open Vocabularies (LOV). One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.
Don't like RDF reification?: making statements about statements using singleton property BIBAFull-Text 759-770
  Vinh Nguyen; Olivier Bodenreider; Amit Sheth
Statements about RDF statements, or meta triples, provide additional information about individual triples, such as the source, the occurring time or place, or the certainty. Integrating such meta triples into semantic knowledge bases would enable the querying and reasoning mechanisms to be aware of provenance, time, location, or certainty of triples. However, an efficient RDF representation for such meta knowledge of triples remains challenging. The existing standard reification approach allows such meta knowledge of RDF triples to be expressed using RDF by two steps. The first step is representing the triple by a Statement instance which has subject, predicate, and object indicated separately in three different triples. The second step is creating assertions about that instance as if it is a statement. While reification is simple and intuitive, this approach does not have formal semantics and is not commonly used in practice as described in the RDF Primer. In this paper, we propose a novel approach called Singleton Property for representing statements about statements and provide a formal semantics for it. We explain how this singleton property approach fits well with the existing syntax and formal semantics of RDF, and the syntax of SPARQL query language. We also demonstrate the use of singleton property in the representation and querying of meta knowledge in two examples of Semantic Web knowledge bases: YAGO2 and BKR. Our experiments on the BKR show that the singleton property approach gives a decent performance in terms of number of triples, query length and query execution time compared to existing approaches. This approach, which is also simple and intuitive, can be easily adopted for representing and querying statements about statements in other knowledge bases.

Web mining 3

Comment-based multi-view clustering of web 2.0 items BIBAFull-Text 771-782
  Xiangnan He; Min-Yen Kan; Peichu Xie; Xiao Chen
Clustering Web 2.0 items (i.e., web resources like videos, images) into semantic groups benefits many applications, such as organizing items, generating meaningful tags and improving web search. In this paper, we systematically investigate how user-generated comments can be used to improve the clustering of Web 2.0 items. In our preliminary study of Last.fm, we find that the two data sources extracted from user comments -- the textual comments and the commenting users -- provide complementary evidence to the items' intrinsic features. These sources have varying levels of quality, but we importantly we find that incorporating all three sources improves clustering. To accommodate such quality imbalance, we invoke multi-view clustering, in which each data source represents a view, aiming to best leverage the utility of different views.
   To combine multiple views under a principled framework, we propose CoNMF (Co-regularized Non-negative Matrix Factorization), which extends NMF for multi-view clustering by jointly factorizing the multiple matrices through co-regularization. Under our CoNMF framework, we devise two paradigms -- pair-wise CoNMF and cluster-wise CoNMF -- and propose iterative algorithms for their joint factorization. Experimental results on Last.fm and Yelp datasets demonstrate the effectiveness of our solution. In Last.fm, CoNMF betters k-means with a statistically significant F1 increase of 14%, while achieving comparable performance with the state-of-the-art multi-view clustering method CoSC (Co-regularized Spectral Clustering). On a Yelp dataset, CoNMF outperforms the best baseline CoSC with a statistically significant performance gain of 7%.
Finding progression stages in time-evolving event sequences BIBAFull-Text 783-794
  Jaewon Yang; Julian McAuley; Jure Leskovec; Paea LePendu; Nigam Shah
Event sequences, such as patients' medical histories or users' sequences of product reviews, trace how individuals progress over time. Identifying common patterns, or progression stages, in such event sequences is a challenging task because not every individual follows the same evolutionary pattern, stages may have very different lengths, and individuals may progress at different rates. In this paper, we develop a model-based method for discovering common progression stages in general event sequences. We develop a generative model in which each sequence belongs to a class, and sequences from a given class pass through a common set of stages, where each sequence evolves at its own rate. We then develop a scalable algorithm to infer classes of sequences, while also segmenting each sequence into a set of stages. We evaluate our method on event sequences, ranging from patients' medical histories to online news and navigational traces from the Web. The evaluation shows that our methodology can predict future events in a sequence, while also accurately inferring meaningful progression stages, and effectively grouping sequences based on common progression patterns. More generally, our methodology allows us to reason about how event sequences progress over time, by discovering patterns and categories of temporal evolution in large-scale datasets of events.
On estimating the average degree BIBAFull-Text 795-806
  Anirban Dasgupta; Ravi Kumar; Tamas Sarlos
Networks are characterized by nodes and edges. While there has been a spate of recent work on estimating the number of nodes in a network, the edge-estimation question appears to be largely unaddressed. In this work we consider the problem of estimating the average degree of a large network using efficient random sampling, where the number of nodes is not known to the algorithm. We propose a new estimator for this problem that relies on access to node samples under a prescribed distribution. Next, we show how to efficiently realize this ideal estimator in a random walk setting. Our estimator has a natural and simple implementation using random walks; we bound its performance in terms of the mixing time of the underlying graph. We then show that our estimators are both provably and practically better than many natural estimators for the problem. Our work contrasts with existing theoretical work on estimating average degree, which assume that a uniform random sample of nodes is available and the number of nodes is known.

Social networks 3 -- modeling influences in graphs

Who proposed the relationship?: recovering the hidden directions of undirected social networks BIBAFull-Text 807-818
  Jun Zhang; Chaokun Wang; Jianmin Wang
Together with the sign (positive or negative) and strength (strong or weak), the directionality is also an important property of social ties, though usually ignored in undirected social networks for its invisibility. However, we believe most social ties are natively directed, and the awareness of directionality can improve our understanding about the network structures and further benefit social network analysis and mining tasks. Thus it's appealing to study whether there exist interesting patterns about directionality in social networks and whether we can learn the directions for undirected networks based on these patterns. In this study, we engage in the investigation of directionality patterns on real-world directed social networks and summarize our findings using four consistency hypotheses. Based on these hypotheses, we propose ReDirect, an optimization framework which makes it possible to infer the hidden directions of undirected social ties based on the network topology only. This general framework can incorporate various predictive models under specific scenarios. Furthermore, we show how to improve ReDirect by introducing semi/self-supervision in the framework and how to construct the self-labeled training data using simple but effective heuristics. Experimental results show that even without external information, our approach can recover the directions of networks effectively.
   Moreover, we're quite surprising to find that ReDirect can benefit predictive tasks remarkably, with a case study of link prediction. In experiments the redirected networks inferred using ReDirect are proven much more informative than original undirected ones and can improve the prediction performance significantly. It convinces us that ReDirect can be a beneficial general data preprocess tool for various network analysis and mining tasks by uncovering the hidden directions of undirected social networks.
User profiling in an ego network: co-profiling attributes and relationships BIBAFull-Text 819-830
  Rui Li; Chi Wang; Kevin Chen-Chuan Chang
User attributes, such as occupation, education, and location, are important for many applications. In this paper, we study the problem of profiling user attributes in social network. To capture the correlation between attributes and social connections, we present a new insight that social connections are discriminatively correlated with attributes via a hidden factor -- relationship type. For example, a user's colleagues are more likely to share the same employer with him than other friends. Based on the insight, we propose to co-profile users' attributes and relationship types of their connections. To achieve co-profiling, we develop an efficient algorithm based on an optimization framework. Our algorithm captures our insight effectively. It iteratively profiles attributes by propagation via certain types of connections, and profiles types of connections based on attributes and the network structure. We conduct extensive experiments to evaluate our algorithm. The results show that our algorithm profiles various attributes accurately, which improves the state-of-the-art methods by 12%.
Attributed graph models: modeling network structure with correlated attributes BIBAFull-Text 831-842
  Joseph J., III Pfeiffer; Sebastian Moreno; Timothy La Fond; Jennifer Neville; Brian Gallagher
Online social networks have become ubiquitous to today's society and the study of data from these networks has improved our understanding of the processes by which relationships form. Research in statistical relational learning focuses on methods to exploit correlations among the attributes of linked nodes to predict user characteristics with greater accuracy. Concurrently, research on generative graph models has primarily focused on modeling network structure without attributes, producing several models that are able to replicate structural characteristics of networks such as power law degree distributions or community structure. However, there has been little work on how to generate networks with real-world structural properties and correlated attributes. In this work, we present the Attributed Graph Model (AGM) framework to jointly model network structure and vertex attributes. Our framework learns the attribute correlations in the observed network and exploits a generative graph model, such as the Kronecker Product Graph Model (KPGM) and Chung Lu Graph Model (CL), to compute structural edge probabilities. AGM then combines the attribute correlations with the structural probabilities to sample networks conditioned on attribute values, while keeping the expected edge probabilities and degrees of the input graph model. We outline an efficient method for estimating the parameters of AGM, as well as a sampling method based on Accept-Reject sampling to generate edges with correlated attributes. We demonstrate the efficiency and accuracy of our AGM framework on two large real-world networks, showing that AGM scales to networks with hundreds of thousands of vertices, as well as having high attribute correlation.

Content quality & popularity

WikiWho: precise and efficient attribution of authorship of revisioned content BIBAFull-Text 843-854
  Fabian Flöck; Maribel Acosta
Revisioned text content is present in numerous collaboration platforms on the Web, most notably Wikis. To track authorship of text tokens in such systems has many potential applications; the identification of main authors for licensing reasons or tracing collaborative writing patterns over time, to name some. In this context, two main challenges arise. First, it is critical for such an authorship tracking system to be precise in its attributions, to be reliable for further processing. Second, it has to run efficiently even on very large datasets, such as Wikipedia. As a solution, we propose a graph-based model to represent revisioned content and an algorithm over this model that tackles both issues effectively. We describe the optimal implementation and design choices when tuning it to a Wiki environment. We further present a gold standard of 240 tokens from English Wikipedia articles annotated with their origin. This gold standard was created manually and confirmed by multiple independent users of a crowdsourcing platform. It is the first gold standard of this kind and quality and our solution achieves an average of 95% precision on this data set. We also perform a first-ever precision evaluation of the state-of-the-art algorithm for the task, exceeding it by over 10% on average. Our approach outperforms the execution time of the state-of-the-art by one order of magnitude, as we demonstrate on a sample of over 240 English Wikipedia articles. We argue that the increased size of an optional materialization of our results by about 10% compared to the baseline is a favorable trade-off, given the large advantage in runtime performance.
What makes a good biography?: multidimensional quality analysis based on wikipedia article feedback data BIBAFull-Text 855-866
  Lucie Flekova; Oliver Ferschke; Iryna Gurevych
With more than 22 million articles, the largest collaborative knowledge resource never sleeps, experiencing several article edits every second. Over one fifth of these articles describes individual people, the majority of which are still alive. Such articles are, by their nature, prone to corruption and vandalism. Manual quality assurance by experts can barely cope with this massive amount of data. Can it be effectively replaced by feedback from the crowd? Can we provide meaningful support for quality assurance with automated text processing techniques? Which properties of the articles should then play a key role in the machine learning algorithms and why? In this paper, we study the user-perceived quality of Wikipedia articles based on a novel Wikipedia user feedback dataset. In contrast to previous work on quality assessment which mostly relied on judgements of active Wikipedia authors, we analyze ratings of ordinary Wikipedia users along four quality dimensions (Complete, Well written, Trustworthy and Objective). We first present an empirical analysis of the novel dataset with over 36 million Wikipedia article ratings. We then select a subset of biographical articles and perform classification experiments to predict their quality ratings along each of the dimensions, exploring multiple linguistic, surface and network properties of the rated articles. Additionally, we study the classification performance and differences for the biographies of living and dead people as well as those for men and women. We demonstrate the effectiveness of our approach by the F-scores of 0.94, 0.89, 0.73, and 0.73 for the dimensions Complete, Well written, Trustworthy, and Objective. Based on the results, we believe that the quality assessment of big textual data can be effectively supported by current text classification and language processing tools.
What makes an image popular? BIBAFull-Text 867-876
  Aditya Khosla; Atish Das Sarma; Raffay Hamid
Hundreds of thousands of photographs are uploaded to the internet every minute through various social networking and photo sharing platforms. While some images get millions of views, others are completely ignored. Even from the same users, different photographs receive different number of views. This begs the question: What makes a photograph popular? Can we predict the number of views a photograph will receive even before it is uploaded? These are some of the questions we address in this work. We investigate two key components of an image that affect its popularity, namely the image content and social context. Using a dataset of about 2.3 million images from Flickr, we demonstrate that we can reliably predict the normalized view count of images with a rank correlation of 0.81 using both image content and social cues. In this paper, we show the importance of image cues such as color, gradients, deep learning features and the set of objects present, as well as the importance of various social cues such as number of friends or number of photos uploaded that lead to high or low popularity of images.

Conflicts & games

STFU NOOB!: predicting crowdsourced decisions on toxic behavior in online games BIBAFull-Text 877-888
  Jeremy Blackburn; Haewoon Kwak
One problem facing players of competitive games is negative, or toxic, behavior. League of Legends, the largest eSport game, uses a crowdsourcing platform called the Tribunal to judge whether a reported toxic player should be punished or not. The Tribunal is a two stage system requiring reports from those players that directly observe toxic behavior, and human experts that review aggregated reports. While this system has successfully dealt with the vague nature of toxic behavior by majority rules based on many votes, it naturally requires tremendous cost, time, and human efforts. In this paper, we propose a supervised learning approach for predicting crowdsourced decisions on toxic behavior with large-scale labeled data collections; over 10 million user reports involved in 1.46 million toxic players and corresponding crowdsourced decisions. Our result shows good performance in detecting overwhelmingly majority cases and predicting crowdsourced decisions on them. We demonstrate good portability of our classifier across regions. Finally, we estimate the practical implications of our approach, potential cost savings and victim protection.
Unveiling group characteristics in online social games: a socio-economic analysis BIBAFull-Text 889-900
  Taejoong Chung; Jinyoung Han; Daejin Choi; Taekyoung Ted Kwon; Huy Kang Kim; Yanghee Choi
Understanding the group characteristics in MMORPGs is important in user behavior studies since people tend to gather together and form groups due to their inherent nature. In this paper, we analyze the group activities of users in Aion, one of the largest MMORPGs, based on the records of the activities of 94,497 users. In particular, we focus on (i) how social interactions within a group differ from the ones across groups, (ii) what makes a group rise, sustain, or fall, (iii) how group members join and leave a group, and (iv) what makes a group end. We first find that structural patterns of social interactions within a group are more likely to be close-knit and reciprocative than the ones across groups. We also observe that members in a rising group (i.e., the number of members increases) are more cohesive, and communicate with more evenly within the group than the ones in other groups. Our analysis further reveals that if a group is not cohesive, not actively communicating, or not evenly communicating among members, members of the group tend to leave.
XXXtortion?: inferring registration intent in the .XXX TLD BIBAFull-Text 901-912
  Tristan Halvorson; Kirill Levchenko; Stefan Savage; Geoffrey M. Voelker
After a decade-long approval process, multiple rejections, and an independent review, ICANN approved the xxx TLD for inclusion in the Domain Name System, to begin general availability on December 6, 2011. Its sponsoring registry proposed it as an expansion of the name space, as well as a way to separate adult from child-appropriate content. Many independent groups, including trademark holders, political groups, and the adult entertainment industry itself, were concerned that it would primarily generate value through defensive and speculative registrations, without actually serving a real need. This paper measures the validity of these concerns using data gathered from ICANN, whois, and Web requests. We use this information to characterize each xxx domain and infer the registrant's most likely intent. We find that at most 3.8% of xxx domains host or redirect to potentially legitimate Web content, with the rest generally serving either defensive or speculative purposes. Indeed, registrants spent roughly $13M up front to defend existing brands and trademarks within the xxx TLD, and an additional $11M over the course of the first year. Additional evidence suggests that over 80% of annual domain registrations are for purely defensive purposes and do not even resolve.

Social networks 4 -- diffusion

The bursty dynamics of the Twitter information network BIBAFull-Text 913-924
  Seth A. Myers; Jure Leskovec
In online social media systems users are not only posting, consuming, and resharing content, but also creating new and destroying existing connections in the underlying social network. While each of these two types of dynamics has individually been studied in the past, much less is known about the connection between the two. How does user information posting and seeking behavior interact with the evolution of the underlying social network structure? Here, we study ways in which network structure reacts to users posting and sharing content. We examine the complete dynamics of the Twitter information network, where users post and reshare information while they also create and destroy connections. We find that the dynamics of network structure can be characterized by steady rates of change, interrupted by sudden bursts. Information diffusion in the form of cascades of post re-sharing often creates such sudden bursts of new connections, which significantly change users' local network structure. We also explore the effect of the information content on the dynamics of the network and find evidence that the appearance of new topics and real-world events can lead to significant changes in edge creations and deletions. Lastly, we develop a model that quantifies the dynamics of the network and the occurrence of these bursts as a function of the information spreading through the network. The model can successfully predict which information diffusion events will lead to bursts in network dynamics.
Can cascades be predicted? BIBAFull-Text 925-936
  Justin Cheng; Lada Adamic; P. Alex Dow; Jon Michael Kleinberg; Jure Leskovec
On many social networking web sites such as Facebook and Twitter, resharing or reposting functionality allows users to share others' content with their own friends or followers. As content is reshared from user to user, large cascades of reshares can form. While a growing body of research has focused on analyzing and characterizing such cascades, a recent, parallel line of work has argued that the future trajectory of a cascade may be inherently unpredictable. In this work, we develop a framework for addressing cascade prediction problems. On a large sample of photo reshare cascades on Facebook, we find strong performance in predicting whether a cascade will continue to grow in the future. We find that the relative growth of a cascade becomes more predictable as we observe more of its reshares, that temporal and structural features are key predictors of cascade size, and that initially, breadth, rather than depth in a cascade is a better indicator of larger cascades. This prediction performance is robust in the sense that multiple distinct classes of features all achieve similar performance. We also discover that temporal features are predictive of a cascade's eventual shape. Observing independent cascades of the same content, we find that while these cascades differ greatly in size, we are still able to predict which ends up the largest.
How to influence people with partial incentives BIBAFull-Text 937-948
  Erik D. Demaine; MohammadTaghi Hajiaghayi; Hamid Mahini; David L. Malec; S. Raghavan; Anshul Sawant; Morteza Zadimoghadam
We study the power of fractional allocations of resources to maximize our influence in a network. This work extends in a natural way the well-studied model by Kleinberg, Kempe, and Tardos (2003), where a designer selects a (small) seed set of nodes in a social network to influence directly, this influence cascades when other nodes reach certain thresholds of neighbor influence, and the goal is to maximize the final number of influenced nodes. Despite extensive study from both practical and theoretical viewpoints, this model limits the designer to a binary choice for each node, with no chance to apply intermediate levels of influence. This model captures some settings precisely, such as exposure to an idea or pathogen, but it fails to capture very relevant concerns in others, for example, a manufacturer promoting a new product by distributing five "20% off" coupons instead of giving away a single free product.
   While fractional versions of problems tend to be easier to solve than integral versions, for influence maximization, we show that the two versions have essentially the same computational complexity. On the other hand, the two versions can have vastly different solutions: the added flexibility of fractional allocation can lead to significantly improved influence. Our main theoretical contribution is to show how to adapt the major positive results from the integral case to the fractional case. Specifically, Mossel and Roch (2006) used the submodularity of influence to obtain their integral results; we introduce a new notion of continuous submodularity, and use this to obtain matching fractional results. We conclude that we can achieve the same greedy (1-1/e-ε)-approximation for the fractional case as the integral case, and that other heuristics are likely to carry over as well. In practice, we find that the fractional model performs substantially better than the integral model, according to simulations on real-world social network data.

Web search 2

Fast topic discovery from web search streams BIBAFull-Text 949-960
  Di Jiang; Kenneth Wai-Ting Leung; Wilfred Ng
Web search involves voluminous data streams that record millions of users' interactions with the search engine. Recently latent topics in web search data have been found to be critical for a wide range of search engine applications such as search personalization and search history warehousing. However, the existing methods usually discover latent topics from web search data in an offline and retrospective fashion. Hence, they are increasingly ineffective in the face of the ever-increasing web search data that accumulate in the format of online streams. In this paper, we propose a novel probabilistic topic model, the Web Search Stream Model (WSSM), which is delicately calibrated for handling two salient features of the web search data: it is in the format of streams and in massive volume. We further propose an efficient parameter inference method, the Stream Parameter Inference (SPI) to efficiently train WSSM with massive web search streams. Based on a large-scale search engine query log, we conduct extensive experiments to verify the effectiveness and efficiency of WSSM and SPI. We observe that WSSM together with SPI discovers latent topics from web search streams faster than the state-of-the-art methods while retaining a comparable topic modeling accuracy.
A hierarchical Dirichlet model for taxonomy expansion for search engines BIBAFull-Text 961-970
  Jingjing Wang; Changsung Kang; Yi Chang; Jiawei Han
Emerging trends and products pose a challenge to modern search engines since they must adapt to the constantly changing needs and interests of users. For example, vertical search engines, such as Amazon, eBay, Walmart, Yelp and Yahoo! Local, provide business category hierarchies for people to navigate through millions of business listings. The category information also provides important ranking features that can be used to improve search experience. However, category hierarchies are often manually crafted by some human experts and they are far from complete. Manually constructed category hierarchies cannot handle the ever-changing and sometimes long-tail user information needs. In this paper, we study the problem of how to expand an existing category hierarchy for a search/navigation system to accommodate the information needs of users more comprehensively. We propose a general framework for this task, which has three steps: 1) detecting meaningful missing categories; 2) modeling the category hierarchy using a hierarchical Dirichlet model and predicting the optimal tree structure according to the model; 3) reorganizing the corpus using the complete category structure, i.e., associating each webpage with the relevant categories from the complete category hierarchy. Experimental results demonstrate that our proposed framework generates a high-quality category hierarchy and significantly boosts the retrieval performance.
Recent and robust query auto-completion BIBAFull-Text 971-982
  Stewart Whiting; Joemon M. Jose
Query auto-completion (QAC) is a common interactive feature that assists users in formulating queries by providing completion suggestions as they type. In order for QAC to minimise the user's cognitive and physical effort, it must: (i) suggest the user's intended query after minimal input keystrokes, and (ii) rank the user's intended query highly in completion suggestions. Typically, QAC approaches rank completion suggestions by their past popularity. Accordingly, QAC is usually very effective for previously seen and consistently popular queries. Users are increasingly turning to search engines to find out about unpredictable emerging and ongoing events and phenomena, often using previously unseen or unpopular queries. Consequently, QAC must be both robust and time-sensitive -- that is, able to sufficiently rank both consistently and recently popular queries in completion suggestions. To address this trade-off, we propose several practical completion suggestion ranking approaches, including: (i) a sliding window of query popularity evidence from the past 2-28 days, (ii) the query popularity distribution in the last N queries observed with a given prefix, and (iii) short-range query popularity prediction based on recently observed trends. Using real-time simulation experiments, we extensively investigated the parameters necessary to maximise QAC effectiveness for three openly available query log datasets with prefixes of 2-5 characters: MSN and AOL (both English), and Sogou 2008 (Chinese). Optimal parameters vary for each query log, capturing the differing temporal dynamics and querying distributions. Results demonstrate consistent and language-independent improvements of up to 9.2% over a non-temporal QAC baseline for all query logs with prefix lengths of 2-3 characters. This work is an important step towards more effective QAC approaches.