HCI Bibliography Home | HCI Conferences | CIKM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CIKM Tables of Contents: 0809101112131415

Proceedings of the 2015 ACM Conference on Information and Knowledge Management

Fullname:Proceedings of the 24th ACM International Conference on Conference on Information and Knowledge Management
Editors:James Bailey; Alistair Moffat; Charu C. Aggarwal; Maarten de Rijke; Ravi Kumar; Vanessa Murdock; Timos Sellis; Jeffrey Xu Yu
Location:Melbourne, Australia
Dates:2015-Oct-19 to 2015-Oct-23
Standard No:ISBN: 978-1-4503-3794-6; ACM DL: Table of Contents; hcibib: CIKM15
Links:Conference Website
  1. Keynote Address I
  2. Session 1A: Scalability
  3. Session 1B: Personal Search
  4. Session 1C: Learning
  5. Session 1D: Text Processing
  6. Session 1E: Applications
  7. Session 1F: Social Media 1
  8. Session 2A: Graphs
  9. Session 2B: Retrieval Algorithms
  10. Session 2C: Text Analysis
  11. Session 2D: Clustering
  12. Session 2E: Users and Predictions
  13. Session 2F: Heterogeneous Networks
  14. Session 3A: Veracity
  15. Session 3B: Social Networks 1
  16. Session 3C: Query Completion
  17. Session 3D: Microblogs
  18. Session 3E: Graph-Based Analysis
  19. Session 3F: Classification 1
  20. Keynote Address II
  21. Session 4A: Location-Based Services
  22. Session 4B: Query Explanation
  23. Session 4C: Crowds
  24. Session 4D: Optimization
  25. Session 4E: Social Networks 2
  26. Session 4F: Matrix Factorization
  27. Session 5A: Trips and Trajectories
  28. Session 5B: Retrieval Enhancements 1
  29. Session 5C: Privacy
  30. Session 5D: Data Streams
  31. Session 5E: Classification 2
  32. Session 5F: Sentiment and Content Analysis
  33. Session 6A: Time Series and Streams
  34. Session 6B: Adaptive Learning
  35. Session 6C: Points-of-Interest
  36. Session 6D: Matrices
  37. Session 6E: Citation Networks
  38. Session 6F: Knowledge Bases
  39. Keynote Address III
  40. Session 7A: Database Optimization
  41. Session 7B: Retrieval Enhancements 2
  42. Session 7C: Search Mechanisms
  43. Session 7D: Social Networks 3
  44. Session 8A: Query Evaluation
  45. Session 8B: Web Search
  46. Session 8C: Social Media 2
  47. Session 8D: Recommendation
  48. Short Papers: Databases
  49. Short Papers: Information Retrieval
  50. Short Papers: Knowledge Management
  51. Workshop Reports

Keynote Address I

Slow Search: Improving Information Retrieval Using Human Assistance BIBAFull-Text 1
  Jaime Teevan
We live in a world where the pace of everything from communication to transportation is getting faster. In recent years a number of "slow movements" have emerged that advocate for reducing speed in exchange for increasing quality, including the slow food movement, slow parenting, slow travel, and even slow science. We propose the concept of "slow search," where search engines use additional time to provide a higher quality search experience than is possible given conventional time constraints. While additional time can be used to identify relevant results within the existing search engine framework, it can also be used to create new search artifacts and enable previously unimaginable user experiences. This talk will focus on how search engines can make use of additional time to employ a resource that is inherently slow: other people. Using crowdsourcing and friendsourcing, it will highlight opportunities for search systems to support new search experiences with high quality result content that takes time to identify.

Session 1A: Scalability

External Data Access And Indexing In AsterixDB BIBAFull-Text 3-12
  Abdullah A. Alamoudi; Raman Grover; Michael J. Carey; Vinayak Borkar
Traditional database systems offer rich query interfaces (SQL) and efficient query execution for data that they store. Recent years have seen the rise of Big Data analytics platforms offering query-based access to "raw" external data, e.g., file-resident data (often in HDFS). In this paper, we describe techniques to achieve the qualities offered by DBMSs when accessing external data. This work has been built into Apache AsterixDB, an open source Big Data Management System. We describe how we build distributed indexes over external data, partition external indexes, provide query consistency across access paths, and manage external indexes amidst concurrent activities. We compare the performance of this new AsterixDB capability to an external-only solution (Hive) and to its internally managed data and indexes.
Dynamic Resource Management In a Massively Parallel Stream Processing Engine BIBAFull-Text 13-22
  Kasper Grud Skat Madsen; Yongluan Zhou
The emerging interest in Massively Parallel Stream Processing Engines (MPSPEs), which are able to process long-standing computations over data streams with ever-growing velocity at a large-scale cluster, calls for efficient dynamic resource management techniques to avoid any waste of resources and/or excessive processing latency. In this paper, we propose an approach to integrate dynamic resource management with passive fault-tolerance mechanisms in a MPSPE so that we can harvest the checkpoints prepared for failure recovery to enhance the efficiency of dynamic load migrations. To maximize the opportunity of reusing checkpoints for fast load migration, we formally define a checkpoint allocation problem and provide a pragmatic algorithm to solve it. We implement all the proposed techniques on top of Apache Storm, an open-source MPSPE, and conduct extensive experiments using a real dataset to examine various aspects of our techniques. The results show that our techniques can greatly improve the efficiency of dynamic resource reconfiguration without imposing significant overhead or latency to the normal job execution.
A Parallel GPU-Based Approach to Clustering Very Fast Data Streams BIBAFull-Text 23-32
  Pengtao Huang; Xiu Li; Bo Yuan
Clustering data streams has become a hot topic in the era of big data. Driven by the ever increasing volume, velocity and variety of data, more efficient algorithms for clustering large-scale complex data streams are needed. In this paper, we present a parallel algorithm called PaStream, which is based on advanced Graphics Processing Unit (GPU) and follows the online-offline framework of CluStream. Our approach can achieve hundreds of times speedup on high-speed and high-dimensional data streams compared with CluStream. It can also discover clusters with arbitrary shapes and handle outliers properly. The efficiency and scalability of PaStream are demonstrated through comprehensive experiments on synthetic and standard benchmark datasets with various problem factors.
Scalable Clustering Algorithm via a Triangle Folding Processing for Complex Networks BIBAFull-Text 33-42
  Ying Kang; Xiaoyan Gu; Weiping Wang; Dan Meng
Facing up to the incessant growth of complex networks, more and more researchers start turning to a multilevel computing paradigm with high scalability for clustering. By virtue of iterative coarsening level by level, the clustering results which are obtained from the coarsest network and then projected to the original network, is superior to the ones from mining the original complex network explicitly. Empirical works reflect that the local-aggregation characteristic is a key point for multilevel clustering algorithms, thus techniques like modularity, label propagation etc. are used to discover the micro-clusters for coarsening. In this paper, we propose a scalable clustering algorithm via a triangle folding processing for complex networks (SCAFT). Based on the strong cluster property of triangle, we fold each traversed triangle of the network into a superverex to realize coarsening. And each generated coarsened network by iteration is capable of reserving the cluster structures of last level network, or even the intrinsic cluster structures of original complex network, improving the computational accuracy. What's more, a streaming algorithm is embedded in our novel approach to generate a serial input sequence of vertices, reducing the heavy burdens of memory usage of system. Experimental results on real-world complex networks show that, SCAFT outperforms the state-of-the-art multilevel clustering algorithms in terms of clustering accuracy, running time, especially in memory usage.

Session 1B: Personal Search

Understanding the Impact of the Role Factor in Collaborative Information Retrieval BIBAFull-Text 43-52
  Lynda Tamine; Laure Soulier
Collaborative information retrieval systems often rely on division of labor policies. Such policies allow work to be divided among collaborators with the aim of preventing redundancy and optimizing the synergic effects of collaboration. Most of the underlying methods achieve these goals by the means of explicit vs. implicit role-based mediation. In this paper, we investigate whether and how different factors, such as users' behavior, search strategies, and effectiveness, are related to role assignment within a collaborative exploratory search. Our main findings suggest that: (1) spontaneous and cohesive implicit roles might emerge during the collaborative search session implying users with no prior roles, and that these implicit roles favor the search precision, (2) role drift might occur alongside the search session performed by users with prior-assigned roles.
Experiments with a Venue-Centric Model for Personalised and Time-Aware Venue Suggestion BIBAFull-Text 53-62
  Romain Deveaud; M-Dyaa Albakour; Craig Macdonald; Iadh Ounis
Location-based social networks (LBSNs), such as Foursquare, fostered the emergence of new tasks such as recommending venues a user might wish to visit. In the literature, recommending venues has typically been addressed using user-centric recommendation approaches relying on collaborative filtering techniques. Such approaches not only require many users with detailed profiles to be effective, but they also cannot recommend venues to users who are not actually members of the LBSN. In contrast, in this paper, we introduce a venue-centric yet personalised probabilistic approach that suggests personalised and popular venues for users to visit in the near future. In our approach, we probabilistically incorporate two components, a popularity component for predicting the popularity of a venue at a given point in time, as estimated from the attendance of the venue in the LBSN (i.e. number of check-ins), and a personalisation component for identifying its interestingness with respect to the estimated preferences of the user. The popularity of each venue is predicted using time series forecasting models that are trained on the recent attendance trends of the venue, while the users' interests are modelled from the entity pages that they like on Facebook. Using three major cities, we conduct a user study to evaluate the effectiveness of the two components of our approach in suggesting venues for different types of users at different times of the day. Our experimental results show that an approach that combines the popularity and personalisation components is able to consistently outperform the recommendation service of the leading Foursquare LBSN. We also find that combining popularity and personalisation is effective for both new visitors and residents, while former visitors prefer popular venues.
Search Result Diversification Based on Hierarchical Intents BIBAFull-Text 63-72
  Sha Hu; Zhicheng Dou; Xiaojie Wang; Tetsuya Sakai; Ji-Rong Wen
A large percentage of queries issued to search engines are broad or ambiguous. Search result diversification aims to solve this problem, by returning diverse results that can fulfill as many different information needs as possible. Most existing intent-aware search result diversification algorithms formulate user intents for a query as a flat list of subtopics. In this paper, we introduce a new hierarchical structure to represent user intents and propose two general hierarchical diversification models to leverage hierarchical intents. Experimental results show that our hierarchical diversification models outperform state-of-the-art diversification methods that use traditional flat subtopics.
Category-Driven Approach for Local Related Business Recommendations BIBAFull-Text 73-82
  Yonathan Perez; Michael Schueppert; Matthew Lawlor; Shaunak Kishore
When users search online for a business, the search engine may present them with a list of related business recommendations. We address the problem of constructing a useful and diverse list of such recommendations that would include an optimal combination of substitutes and complements. Substitutes are similar potential alternatives to the searched business, whereas complements are local businesses that can offer a more comprehensive and better rounded experience for a user visiting the searched locality. In our problem setting, each business belongs to a category in an ontology of business categories. Two businesses are defined as substitutes of one another if they belong to the same category, and as complements if they are otherwise relevant to each other.
   We empirically demonstrate that the related business recommendation lists generated by Google's search engine are too homogeneous, and overemphasize substitutes. We then use various data sources such as crowdsourcing, mobile maps directions queries, and the existing Google's related business graph to mine association rules to determine to which extent do categories complement each other, and establish relevance between businesses, using both category-level and individual business-level information. We provide an algorithmic approach that incorporates these signals to produce a list of recommended businesses that balances pairwise business relevance with overall diversity of the list. Finally, we use human raters to evaluate our system, and show that it significantly improves on the current Google system in usefulness of the generated recommendation lists.

Session 1C: Learning

A Soft Computing Approach for Learning to Aggregate Rankings BIBAFull-Text 83-92
  Javier Alvaro Vargas Muñoz; Ricardo da Silva Torres; Marcos André Gonçalves
This paper presents an approach to combine rank aggregation techniques using a soft computing technique -- Genetic Programming -- in order to improve the results in Information Retrieval tasks. Previous work shows that by combining rank aggregation techniques in an agglomerative way, it is possible to get better results than with individual methods. However, these works either combine only a small set of lists or are performed in a completely ad-hoc way. Therefore, given a set of ranked lists and a set of rank aggregation techniques, we propose to use a supervised genetic programming approach to search combinations of them that maximize effectiveness in large search spaces. Experimental results conducted using four datasets with different properties show that our proposed approach reaches top performance in most datasets. Moreover, this cross-dataset performance is not matched by any other baseline among the many we experiment with, some being the state-of-the-art in learning-to-rank and in the supervised rank aggregation tasks. We also show that our proposed framework is very efficient, flexible, and scalable.
Approximate String Matching by End-Users using Active Learning BIBAFull-Text 93-102
  Lutz Büch; Artur Andrzejak
Identifying approximately identical strings is key for many data cleaning and data integration processes, including similarity join and record matching. The accuracy of such tasks crucially depends on appropriate choices of string similarity measures and thresholds for the particular dataset. Manual selection of similarity measures and thresholds is infeasible. Other approaches rely on the existence of adequate historic ground-truth or massive manual effort.
   To address this problem, we propose an Active Learning algorithm which selects a best performing similarity measure in a given set while optimizing a decision threshold. Active Learning minimizes the number of user queries needed to arrive at an appropriate classifier. Queries require only the label match/no match, which end users can easily provide in their domain. Evaluation on well-known string matching benchmark data sets shows that our approach achieves highly accurate results with a small amount of manual labeling required.
A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank BIBAFull-Text 103-112
  Shoaib Jameel; Wai Lam; Steven Schockaert; Lidong Bing
While most methods for learning-to-rank documents only consider relevance scores as features, better results can often be obtained by taking into account the latent topic structure of the document collection. Existing approaches that consider latent topics follow a two-stage approach, in which topics are discovered in an unsupervised way, as usual, and then used as features for the learning-to-rank task. In contrast, we propose a learning-to-rank framework which integrates the supervised learning of a maximum margin classifier with the discovery of a suitable probabilistic topic model. In this way, the labelled data that is available for the learning-to-rank task can be exploited to identify the most appropriate topics. To this end, we use a unified constrained optimization framework, which can dynamically compute the latent topic similarity score between the query and the document. Our experimental results show a consistent improvement over the state-of-the-art learning-to-rank models.
Collaborating between Local and Global Learning for Distributed Online Multiple Tasks BIBAFull-Text 113-122
  Xin Jin; Ping Luo; Fuzhen Zhuang; Jia He; Qing He
This paper studies the novel learning scenarios of Distributed Online Multi-tasks (DOM), where the learning individuals with continuously arriving data are distributed separately and meanwhile they need to learn individual models collaboratively. It has three characteristics: distributed learning, online learning and multi-task learning. It is motivated by the emerging applications of wearable devices, which aim to provide intelligent monitoring services, such as health emergency alarming and movement recognition.
   To the best of our knowledge, no previous work has been done for this kind of problems. Thus, in this paper a collaborative learning scheme is proposed for this problem. Specifically, it performs local learning and global learning alternately. First, each client performs online learning using the increasing data locally. Then, DOM switches to global learning on the server side when some condition is triggered by clients. Here, an asynchronous online multi-task learning method is proposed for global learning. In this step, only this client's model, which triggers the global learning, is updated with the support of the difficult local data instances and the other clients' models. The experiments from 4 applications show that the proposed method of global learning can improve local learning significantly. DOM framework is effective, since it can share knowledge among distributed tasks and obtain better models than learning them separately. It is also communication efficient, which only requires the clients send a small portion of raw data to the server.

Session 1D: Text Processing

Lifespan-based Partitioning of Index Structures for Time-travel Text Search BIBAFull-Text 123-132
  Animesh Nandi; Suriya Subramanian; Sriram Lakshminarasimhan; Prasad M. Deshpande; Sriram Raghavan
Time-travel text search over a temporally evolving document collection is useful in various applications. Supporting a wide range of query classes demanded by these applications require different index layouts optimized for their respective query access patterns. The problem we tackle is how to efficiently handle different query classes using the same index layout.
   Our approach is to use list intersections on single-attribute indexes of keywords and temporal attributes. Although joint predicate evaluation on single-attribute indexes is inefficient in general, we show that partitioning the index based on version lifespans coupled with exploiting the transaction-time ordering of record-identifiers, can significantly reduce the cost of list intersections.
   We empirically evaluate different index partitioning alternatives on top of open-source Lucene, and show that our approach is the only technique that can simultaneously support a wide range of query classes efficiently, have high indexing throughput in a real-time ingestion setting, and also have negligible extra storage costs.
Contextual Text Understanding in Distributional Semantic Space BIBAFull-Text 133-142
  Jianpeng Cheng; Zhongyuan Wang; Ji-Rong Wen; Jun Yan; Zheng Chen
Representing discrete words in a continuous vector space turns out to be useful for natural language applications related to text understanding. Meanwhile, it poses extensive challenges, one of which is due to the polysemous nature of human language. A common solution (a.k.a word sense induction) is to separate each word into multiple senses and create a representation for each sense respectively. However, this approach is usually computationally expensive and prone to data sparsity, since each sense needs to be managed discriminatively. In this work, we propose a new framework for generating context-aware text representations without diving into the sense space. We model the concept space shared among senses, resulting in a framework that is efficient in both computation and storage. Specifically, the framework we propose is one that: i) projects both words and concepts into the same vector space; ii) obtains unambiguous word representations that not only preserve the uniqueness among words, but also reflect their context-appropriate meanings. We demonstrate the effectiveness of the framework in a number of tasks on text understanding, including word/phrase similarity measurements, paraphrase identification and question-answer relatedness classification.
External Knowledge and Query Strategies in Active Learning: a Study in Clinical Information Extraction BIBAFull-Text 143-152
  Mahnoosh Kholghi; Laurianne Sitbon; Guido Zuccon; Anthony Nguyen
This paper presents a new active learning query strategy for information extraction, called Domain Knowledge Informativeness (DKI). Active learning is often used to reduce the amount of annotation effort required to obtain training data for machine learning algorithms. A key component of an active learning approach is the query strategy, which is used to iteratively select samples for annotation. Knowledge resources have been used in information extraction as a means to derive additional features for sample representation. DKI is, however, the first query strategy that exploits such resources to inform sample selection. To evaluate the merits of DKI, in particular with respect to the reduction in annotation effort that the new query strategy allows to achieve, we conduct a comprehensive empirical comparison of active learning query strategies for information extraction within the clinical domain. The clinical domain was chosen for this work because of the availability of extensive structured knowledge resources which have often been exploited for feature generation. In addition, the clinical domain offers a compelling use case for active learning because of the necessary high costs and hurdles associated with obtaining annotations in this domain. Our experimental findings demonstrated that (1) amongst existing query strategies, the ones based on the classification model's confidence are a better choice for clinical data as they perform equally well with a much lighter computational load, and (2) significant reductions in annotation effort are achievable by exploiting knowledge resources within active learning query strategies, with up to 14% less tokens and concepts to manually annotate than with state-of-the-art query strategies.
Ranking Deep Web Text Collections for Scalable Information Extraction BIBAFull-Text 153-162
  Pablo Barrio; Luis Gravano; Chris Develder
Information extraction (IE) systems discover structured information from natural language text, to enable much richer querying and data mining than possible directly over the unstructured text. Unfortunately, IE is generally a computationally expensive process, and hence improving its efficiency, so that it scales over large volumes of text, is of critical importance. State-of-the-art approaches for scaling the IE process focus on one text collection at a time. These approaches prioritize the extraction effort by learning keyword queries to identify the "useful" documents for the IE task at hand, namely, those that lead to the extraction of structured "tuples." These approaches, however, do not attempt to predict which text collections are useful for the IE task -- and hence merit further processing -- and which ones will not contribute any useful output -- and hence should be ignored altogether, for efficiency. In this paper, we focus on an especially valuable family of text sources, the so-called deep web collections, whose (remote) contents are only accessible via querying. Specifically, we introduce and study techniques for ranking deep web collections for an IE task, to prioritize the extraction effort by focusing on collections with substantial numbers of useful documents for the task. We study both (adaptations of) state-of-the-art resource selection strategies for distributed information retrieval, and IE-specific approaches. Our extensive experimental evaluation over realistic deep web collections, and for several different IE tasks, shows the merits and limitations of the alternative families of approaches, and provides a roadmap for addressing this critically important building block for efficient, scalable information extraction.

Session 1E: Applications

Forming Online Support Groups for Internet and Behavior Related Addictions BIBAFull-Text 163-172
  Chih-Ya Shen; Hong-Han Shuai; De-Nian Yang; Yi-Feng Lan; Wang-Chien Lee; Philip S. Yu; Ming-Syan Chen
While online social networks have become a part of many people's daily lives, Internet and social network addictions (ISNAs) have been noted recently. With increased patients in addictive Internet use, clinicians often form support groups to help patients. This has become a trend because groups organized around therapeutic goals can effectively enrich members with insight and guidance while holding everyone accountable along the way. With the emergence of online social network services, there is a trend to form support groups online with the aid of mental health professionals. Nevertheless, it becomes impractical for a psychiatrist to manually select the group members because she faces an enormous number of candidates, while the selection criteria are also complicated since they span both the social and symptom dimensions. To effectively address the need of mental healthcare professionals, this paper makes the first attempt to study a new problem, namely Member Selection for Online Support Group (MSSG). The problem aims to maximize the similarity of the symptoms of all selected members, while ensuring that any two members are unacquainted to each other. We prove that MSSG is NP-Hard and inapproximable within any ratio, and design a 3-approximation algorithm with a guaranteed error bound. We evaluate MSSG via a user study with 11 mental health professionals, and the results manifest that MSSG can effectively find support group members satisfying the member selection criteria. Experimental results on large-scale real datasets also demonstrate that our proposed algorithm outperforms other baselines in terms of solution quality and efficiency.
Concept-Based Relevance Models for Medical and Semantic Information Retrieval BIBAFull-Text 173-182
  Chunye Wang; Ramakrishna Akella
Relevance models provide an important approach for estimating probabilities of words in the relevant class. However, the associated bag-of-words assumption breaks dependencies between words, especially between those within a phrase. If such dependencies could be preserved, it would permit matching the query terms with document terms having the same dependencies. Additionally, during the estimation of relevance, relevance models are unable to distinguish relevant and non-relevant information in a feedback document, and hence take the entire document into account, which potentially hurts the accuracy of estimation. In this paper, we define the notion of "concept", and design a concept-based information retrieval framework. Using this framework, we transform documents and queries from term space into concept space, and propose a concept-based relevance model for improved estimation of relevance. Our approach has three advantages. First, this approach only assumes independence between concepts, so is able to keep the strong dependencies between the words of a concept. Second, it unifies synonyms or different surface forms of a concept, leading to reduced dimensionality of the space, increased sample size of a concept, and consequently more accurate and reliable estimates of the relevance. Third, when knowledge bases are available, our approach enables the semantic analysis of query concepts, and thus identifies concepts related to the query, from which a more accurate distribution of relevance can be estimated. This work is aligned with semantic search methods.
   We apply our concept-based relevance model to information retrieval in the medical domain, where concepts are abundant and their variations are numerous. We compare with relevance models, BM25 with pseudo relevance feedback, and the state of the art conceptual language models, on several data collections. The proposed model demonstrates consistent and statistically significant improvements across collections, outperforming top benchmark conceptual language models by at least 9% and up to 20% on a number of metrics.
PlateClick: Bootstrapping Food Preferences Through an Adaptive Visual Interface BIBAFull-Text 183-192
  Longqi Yang; Yin Cui; Fan Zhang; John P. Pollak; Serge Belongie; Deborah Estrin
Food preference learning is an important component of wellness applications and restaurant recommender systems as it provides personalized information for effective food targeting and suggestions. However, existing systems require some form of food journaling to create a historical record of an individual's meal selections. In addition, current interfaces for food or restaurant preference elicitation rely extensively on text-based descriptions and rating methods, which can impose high cognitive load, thereby hampering wide adoption.
   In this paper, we propose PlateClick, a novel system that bootstraps food preference using a simple, visual quiz-based user interface. We leverage a pairwise comparison approach with only visual content. Using over 10,028 recipes collected from Yummly, we design a deep convolutional neural network (CNN) to learn the similarity distance metric between food images. Our model is shown to outperform state-of-the-art CNN by 4 times in terms of mean Average Precision. We explore a novel online learning framework that is suitable for learning users' preferences across a large scale dataset based on a small number of interactions (≤ 15). Our online learning approach balances exploitation-exploration and takes advantage of food similarities using preference-propagation in locally connected graphs.
   We evaluated our system in a field study of 227 anonymous users. The results demonstrate that our method outperforms other baselines by a significant margin, and the learning process can be completed in less than one minute. In summary, PlateClick provides a light-weight, immersive user experience for efficient food preference elicitation.
Data Driven Water Pipe Failure Prediction: A Bayesian Nonparametric Approach BIBAFull-Text 193-202
  Peng Lin; Bang Zhang; Yi Wang; Zhidong Li; Bin Li; Yang Wang; Fang Chen
Water pipe failures can cause significant economic and social costs, hence have become the primary challenge to water utilities. In this paper, we propose a Bayesian nonparametric approach, namely the Dirichlet process mixture of hierarchical beta process model, for water pipe failure prediction. It can select high-risk pipes for physical condition assessment, thereby preventing disastrous failures proactively.
   The proposed method is adaptable to the diversity of failure patterns. Its model structure and complexity can automatically adjust according to observed data. Additionally, the sparse failure data problem that often occurs in real-world data is tackled by the proposed method via flexible pipe grouping and failure data sharing. An approximated yet computational efficient Metropolis-within-Gibbs sampling method is developed with the exploitation of the failure data sparsity for model parameter inference.
   The proposed method has been applied to a metropolitan water supply network. The details of the application context are also presented for demonstrating its real-life impact. The comparison experiments conducted on the metropolitan water pipe data show that the proposed approach significantly outperforms the state-of-the-art prediction methods, and it is capable of bringing enormous economic and social savings to water utilities.

Session 1F: Social Media 1

Tumblr Blog Recommendation with Boosted Inductive Matrix Completion BIBAFull-Text 203-212
  Donghyuk Shin; Suleyman Cetintas; Kuang-Chih Lee; Inderjit S. Dhillon
Popular microblogging sites such as Tumblr have attracted hundreds of millions of users as a content sharing platform, where users can create rich content in the form of posts that are shared with other users who follow them. Due to the sheer amount of posts created on such services, an important task is to make quality recommendations of blogs for users to follow. Apart from traditional recommender system settings where the follower graph is the main data source, additional side-information of users and blogs such as user activity (e.g., like and reblog) and rich content (e.g., text and images) are also available to be exploited for enhanced recommendation performance. In this paper, we propose a novel boosted inductive matrix completion method (BIMC) for blog recommendation. BIMC is an additive low-rank model for user-blog preferences consisting of two components; one component captures the low-rank structure of follow relationships and the other captures the latent structure using side-information. Our model formulation combines the power of the recently proposed inductive matrix completion (IMC) model (for side-information) together with a standard matrix completion (MC) model (for low-rank structure). Furthermore, we utilize recently developed deep learning techniques to obtain semantically rich feature representations of text and images that are incorporated in BIMC. Experiments on a large-scale real-world dataset from Tumblr illustrate the effectiveness of the proposed BIMC method.
BiasWatch: A Lightweight System for Discovering and Tracking Topic-Sensitive Opinion Bias in Social Media BIBAFull-Text 213-222
  Haokai Lu; James Caverlee; Wei Niu
We propose a lightweight system for (i) semi-automatically discovering and tracking bias themes associated with opposing sides of a topic; (ii) identifying strong partisans who drive the online discussion; and (iii) inferring the opinion bias of "regular" participants. By taking just two hand-picked seeds to characterize the topic-space (e.g., "pro-choice" and "pro-life") as weak labels, we develop an efficient optimization-based opinion bias propagation method over the social/information network. We show how this approach leads to a 20% accuracy improvement versus a next-best alternative for bias estimation, as well as uncovering the opinion leaders and evolving themes associated with these topics. We also demonstrate how the inferred opinion bias can be integrated into user recommendation, leading to a 26% improvement in precision.
Knowlywood: Mining Activity Knowledge From Hollywood Narratives BIBAFull-Text 223-232
  Niket Tandon; Gerard de Melo; Abir De; Gerhard Weikum
Despite the success of large knowledge bases, one kind of knowledge that has not received attention so far is that of human activities. An example of such an activity is proposing to someone (to get married). For the computer, knowing that this involves two adults, often but not necessarily a woman and a man, that it often takes place in some romantic location, that it typically involves flowers or jewelry, and that it is usually followed by kissing, is a valuable asset for tasks like natural language dialog, scene understanding, or video search.
   This corresponds to the challenging task of acquiring semantic frames that capture human activities, their participating agents, and their typical spatio-temporal contexts. This paper presents a novel approach that taps into movie scripts and other narrative texts. We develop a pipeline for semantic parsing and knowledge distillation, to systematically compile semantically refined activity frames.
   The resulting knowledge base contains hundreds of thousands of activity frames, mined from about two million scenes of movies, TV series, and novels. A manual assessment study, with extensive sampling and statistical significance tests, shows that the frames and their attribute values have an accuracy of at least 80 percent. We also demonstrate the usefulness of activity knowledge by the extrinsic use case of movie scene search.
Entity and Aspect Extraction for Organizing News Comments BIBAFull-Text 233-242
  Radityo Eko Prasojo; Mouna Kacimi; Werner Nutt
News websites give their users the opportunity to participate in discussions about published articles, by writing comments. Typically, these comments are unstructured making it hard to understand the flow of user discussions. Thus, there is a need for organizing comments to help users to (1) gain more insights about news topics, and (2) have an easy access to comments that trigger their interests. In this work, we address the above problem by organizing comments around the entities and the aspects they discuss. More specifically, we propose an approach for entity and aspect extraction from user comments through the following contributions. First, we extend traditional Named-Entity Recognition approaches, using coreference resolution and external knowledge bases, to detect more occurrences of entities in comments. Second, we exploit part-of-speech tag, dependency tag, and lexical databases to extract explicit and implicit aspects around discussed entities. Third, we evaluate our entity and aspect extraction approach, on manually annotated data, showing that it highly increases precision and recall compared to baseline approaches.

Session 2A: Graphs

HDRF: Stream-Based Partitioning for Power-Law Graphs BIBAFull-Text 243-252
  Fabio Petroni; Leonardo Querzoni; Khuzaima Daudjee; Shahin Kamali; Giorgio Iacoboni
Balanced graph partitioning is a fundamental problem that is receiving growing attention with the emergence of distributed graph-computing (DGC) frameworks. In these frameworks, the partitioning strategy plays an important role since it drives the communication cost and the workload balance among computing nodes, thereby affecting system performance. However, existing solutions only partially exploit a key characteristic of natural graphs commonly found in the real-world: their highly skewed power-law degree distributions. In this paper, we propose High-Degree (are) Replicated First (HDRF), a novel streaming vertex-cut graph partitioning algorithm that effectively exploits skewed degree distributions by explicitly taking into account vertex degree in the placement decision. We analytically and experimentally evaluate HDRF on both synthetic and real-world graphs and show that it outperforms all existing algorithms in partitioning quality.
Towards Scale-out Capability on Social Graphs BIBAFull-Text 253-262
  Haichuan Shang; Xiang Zhao; Uday Kiran; Masaru Kitsuregawa
The development of cloud storage and computing has facilitated the rise of various big data applications. As a representative high performance computing (HPC) workload, graph processing is becoming a part of cloud computing. However, scalable computing on large graphs is still dominated by HPC solutions, which require high performance all-to-all collective operations over torus (or mesh) networking. Implementing those torus-based algorithms on commodity clusters, e.g., cloud computing infrastructures, can result in great latency due to inefficient communication. Moreover, designing a highly scalable system for large social graphs, is far from being trivial, as intrinsic features of social graphs, e.g., degree skewness and lacking of locality, often profoundly limit the extent of parallelism.
   To resolve the challenges, we explore the iceberg of developing a scalable system for processing large social graphs on commodity clusters. In particular, we focus on the scale-out capability of the system. We propose a novel separator-combiner based query processing engine which provides native load-balancing and very low communication overhead, such that increasingly larger graphs can be simply addressed by adding more computing nodes to the cluster.
   The proposed system achieves remarkable scale-out capability in processing large social graphs with skew degree distributions, while providing many critical features for big data analytics, such as easy-to-use API, fault-tolerance and recovery. We implement the system as a portable and easily configurable library, and conduct comprehensive experimental studies to demonstrate its effectiveness and efficiency.
Identifying Top-k Structural Hole Spanners in Large-Scale Social Networks BIBAFull-Text 263-272
  Mojtaba Rezvani; Weifa Liang; Wenzheng Xu; Chengfei Liu
Recent studies have shown that in social networks, users who bridge different communities, known as structural hole spanners, have great potentials to acquire available resources from these communities and gain access to multiple sources of information flow. Structural hole spanners are crucial in many applications such as community detections, diffusion controls, and viral marketing. In spite of their importance, not much attention has been paid to them. Particularly, how to characterize the structural hole spanner properties and how to devise efficient yet scalable algorithms to find them are fundamental issues. In this paper, we formulate the problem as the top-k structural hole spanner problem. Specifically, we first provide a generic model to measure the quality of structural hole spanners, by exploring their properties, and show that the problem is NP-hard. We then devise efficient and scalable algorithms, by exploiting the bounded inverse closeness centralities of vertices and making use of articulation points of the network. We finally evaluate the performance of the proposed algorithms through extensive experiments on real and synthetic datasets, and validate the effectiveness of the proposed model. Our experimental results demonstrate that the proposed model can capture the characteristics of structural hole spanners accurately, and the proposed algorithms are very promising.
Scalable Facility Location for Massive Graphs on Pregel-like Systems BIBAFull-Text 273-282
  Kiran Garimella; Gianmarco De Francisci Morales; Aristides Gionis; Mauro Sozio
We propose a new scalable algorithm for the facility-location problem. We study the graph setting, where the cost of serving a client from a facility is represented by the shortest-path distance on a graph. This setting is applicable to various problems arising in the Web and social media, and allows to leverage the inherent sparsity of such graphs.
   To obtain truly scalable performance, we design a parallel algorithm that operates on clusters of shared-nothing machines. In particular, we target modern Pregel-like architectures, and we implement our algorithm on Apache Giraph.
   Our work builds upon previous results: a facility location algorithm for the PRAM model, a recent distance-sketching method for massive graphs, and a parallel algorithm to finding maximal independent sets. The main challenge is to adapt those building blocks to the distributed graph setting, while maintaining the approximation guarantee and limiting the amount of distributed communication. Extensive experimental results show that our algorithm scales gracefully to graphs with billions of edges, while, in terms of quality, being competitive with state-of-the-art sequential algorithms.

Session 2B: Retrieval Algorithms

Rank by Time or by Relevance?: Revisiting Email Search BIBAFull-Text 283-292
  David Carmel; Guy Halawi; Liane Lewin-Eytan; Yoelle Maarek; Ariel Raviv
With Web mail services offering larger and larger storage capacity, most users do not feel the need to systematically delete messages anymore and inboxes keep growing. It is quite surprising that in spite of the huge progress of relevance ranking in Web Search, mail search results are still typically ranked by date. This can probably be explained by the fact that users demand perfect recall in order to "re-find" a previously seen message, and would not trust relevance ranking. Yet mail search is still considered a difficult and frustrating task, especially when trying to locate older messages. In this paper, we study the current search traffic of Yahoo mail, a major Web commercial mail service, and discuss the limitations of ranking search results by date. We argue that this sort-by-date paradigm needs to be revisited in order to account for the specific structure and nature of mail messages, as well as the high-recall needs of users. We describe a two-phase ranking approach, in which the first phase is geared towards maximizing recall and the second phase follows a learning-to-rank approach that considers a rich set of mail-specific features to maintain precision. We present our results obtained on real mail search query traffic, for three different datasets, via manual as well as automatic evaluation. We demonstrate that the default time-driven ranking can be significantly improved in terms of both recall and precision, by taking into consideration time recency and textual similarity to the query, as well as mail-specific signals such as users' actions.
On the Cost of Extracting Proximity Features for Term-Dependency Models BIBAFull-Text 293-302
  Xiaolu Lu; Alistair Moffat; J. Shane Culpepper
Sophisticated ranking mechanisms make use of term dependency features in order to compute similarity scores for documents. These features often include exact phrase occurrences, and term proximity estimates. Both cases build on the intuition that if multiple query terms appear near each other, the document is more likely to be relevant to the query. In this paper we examine the processes used to compute these statistics. Two distinct input structures can be used -- inverted files and direct files. Inverted files must store the position offsets of the terms, while "direct" files represent each document as a sequence of preprocessed term identifiers. Based on these two input modalities, a number of algorithms can be used to compute proximity statistics. Until now, these algorithms have been described in terms of a single set of query terms. But similarity computations such as the Full Dependency Model compute proximity statistics for a collection of related term sets. We present a new approach in which such collections are processed holistically in time that is much less than would be the case if each subquery were to be evaluated independently. The benefits of the new method are demonstrated by a comprehensive experimental study.
An Optimization Framework for Merging Multiple Result Lists BIBAFull-Text 303-312
  Chia-Jung Lee; Qingyao Ai; W. Bruce Croft; Daniel Sheldon
Developing effective methods for fusing multiple ranked lists of documents is crucial to many applications. Federated web search, for instance, has become a common practice where a query is issued to different verticals and a single ranked list of blended results is created. While federated search is regarded as collection fusion, data fusion techniques aim at improving search coverage and precision by combining multiple search runs on a single document collection. In this paper, we study in depth and extend a neural network-based approach, LambdaMerge, for merging results of ranked lists drawn from one (i.e., data fusion) or more (i.e., collection fusion) verticals. The proposed model considers the impact of the quality of documents, ranked lists and verticals for producing the final merged result in an optimization framework. We further investigate the potential of incorporating deep structures into the model with an aim of determining better combinations of different evidence. In the experiments on collection fusion and data fusion, the proposed approach significantly outperforms several standard baselines and state-of-the-art learning-based approaches.
Searching and Stopping: An Analysis of Stopping Rules and Strategies BIBAFull-Text 313-322
  David Maxwell; Leif Azzopardi; Kalervo Järvelin; Heikki Keskustalo
Searching naturally involves stopping points, both at a query level (how far down the ranked list should I go?) and at a session level (how many queries should I issue?). Understanding when searchers stop has been of much interest to the community because it is fundamental to how we evaluate search behaviour and performance. Research has shown that searchers find it difficult to formalise stopping criteria, and typically resort to their intuition of what is "good enough". While various heuristics and stopping criteria have been proposed, little work has investigated how well they perform, and whether searchers actually conform to any of these rules. In this paper, we undertake the first large scale study of stopping rules, investigating how they influence overall session performance, and which rules best match actual stopping behaviour. Our work is focused on stopping at the query level in the context of ad-hoc topic retrieval, where searchers undertake search tasks within a fixed time period. We show that stopping strategies based upon the disgust or frustration point rules -- both of which capture a searcher's tolerance to non-relevance -- typically result in (i) the best overall performance, and (ii) provide the closest approximation to actual searcher behaviour, although a fixed depth approach also performs remarkably well. Findings from this study have implications regarding how we build measures, and how we conduct simulations of search behaviours.

Session 2C: Text Analysis

Automated News Suggestions for Populating Wikipedia Entity Pages BIBAFull-Text 323-332
  Besnik Fetahu; Katja Markert; Avishek Anand
Wikipedia entity pages are a valuable source of information for direct consumption and for knowledge-base construction, update and maintenance. Facts in these entity pages are typically supported by references. Recent studies show that as much as 20% of the references are from online news sources. However, many entity pages are incomplete even if relevant information is already available in existing news articles. Even for the already present references, there is often a delay between the news article publication time and the reference time. In this work, we therefore look at Wikipedia through the lens of news and propose a novel news-article suggestion task to improve news coverage in Wikipedia, and reduce the lag of newsworthy references. Our work finds direct application, as a precursor, to Wikipedia page generation and knowledge-base acceleration tasks that rely on relevant and high quality input sources.
   We propose a two-stage supervised approach for suggesting news articles to entity pages for a given state of Wikipedia. First, we suggest news articles to Wikipedia entities (article-entity placement) relying on a rich set of features which take into account the salience and relative authority of entities, and the novelty of news articles to entity pages. Second, we determine the exact section in the entity page for the input article (article-section placement) guided by class-based section templates. We perform an extensive evaluation of our approach based on ground-truth data that is extracted from external references in Wikipedia. We achieve a high precision value of up to 93% in the article-entity suggestion stage and upto 84% for the article-section placement. Finally, we compare our approach against competitive baselines and show significant improvements.
Mining Coordinated Intent Representation for Entity Search and Recommendation BIBAFull-Text 333-342
  Huizhong Duan; ChengXiang Zhai
We study the problem of learning query intent representation for an entity search task such as product retrieval, where a user would use a keyword query to retrieve entities based on their structured attribute value descriptions. Existing intent representation has been mostly based on the query space. These methods overlook the critical information from the entity space and the connection in between. Consequently, when such representation methods are used in intent mining from user engagement logs in entity search, they cannot fully discover the comprehensive knowledge of user preference, which is essential for improving the effectiveness of entity search and recommendation, as well as many applications such as business intelligence.
   To address this problem, we propose a novel Coordinated Intent Representation, where each user intent is represented collectively in both the query space and the entity space. Specifically, a coordinated intent representation consists of a language model to capture typical query terms used for search and a series of probabilistic distributions on entity attributes and attribute values to characterize the preferred features of entities for the corresponding intent. We propose a novel generative model to discover coordinated intent representations from the entity search logs. Evaluation in the domain of product search shows that the proposed model is effective for discovering meaningful coordinated shopping intents, and the discovered intent representation can be directly used for improving the accuracy of product search and recommendation.
Sentiment Extraction by Leveraging Aspect-Opinion Association Structure BIBAFull-Text 343-352
  Li Zhao; Minlie Huang; Jiashen Sun; Hengliang Luo; Xiankai Yang; Xiaoyan Zhu
Sentiment extraction aims to extract and group the task of extracting and grouping aspect and opinion words from online reviews. Previous works usually extract aspect and opinion words by leveraging association between a single pair of aspect and opinion word[5] [14] [9] [4][11], but the structure of aspect and opinion word clusters has not been fully exploited.
   In this paper, we investigate the aspect-opinion association structure, and propose a "first clustering, then extracting" unsupervised model to leverage properties of the structure for sentiment extraction. For the clustering purpose, we formalise a novel concept syntactic distribution consistency as soft constraint in the framework of posterior regularization; for the extraction purpose, we extract aspect and opinion words based on cluster-cluster association. In comparison to traditional word-word association, we show that cluster-cluster association is a much stronger signal to distinguish aspect (opinion) words from non-aspect (non-opinion) words. Extensive experiments demonstrate the effectiveness of the proposed approach and the advantages against state-of-the-art baselines.
Leveraging Joint Interactions for Credibility Analysis in News Communities BIBAFull-Text 353-362
  Subhabrata Mukherjee; Gerhard Weikum
Media seems to have become more partisan, often providing a biased coverage of news catering to the interest of specific groups. It is therefore essential to identify credible information content that provides an objective narrative of an event. News communities such as digg, reddit, or newstrust offer recommendations, reviews, quality ratings, and further insights on journalistic works. However, there is a complex interaction between different factors in such online communities: fairness and style of reporting, language clarity and objectivity, topical perspectives (like political viewpoint), expertise and bias of community members, and more.
   This paper presents a model to systematically analyze the different interactions in a news community between users, news, and sources. We develop a probabilistic graphical model that leverages this joint interaction to identify 1) highly credible news articles, 2) trustworthy news sources, and 3) expert users who perform the role of "citizen journalists" in the community. Our method extends CRF models to incorporate real-valued ratings, as some communities have very fine-grained scales that cannot be easily discretized without losing information. To the best of our knowledge, this paper is the first full-fledged analysis of credibility, trust, and expertise in news communities.

Session 2D: Clustering

Clustering-based Active Learning on Sensor Type Classification in Buildings BIBAFull-Text 363-372
  Dezhi Hong; Hongning Wang; Kamin Whitehouse
Commercial and industrial buildings account for a considerable portion of all energy consumed in the U.S., and thus reducing this energy consumption is a national grand challenge. Based on the large deployment of sensors in modern commercial buildings, many organizations are applying data analytic solutions to the thousands of sensing and control points to detect wasteful and incorrect operations for energy savings. Scaling this approach is challenging, however, because the metadata about these sensing and control points is inconsistent between buildings, or even missing altogether. Moreover, normalizing the metadata requires significant integration effort.
   In this work, we demonstrate a first step towards an automatic metadata normalization solution that requires minimal human intervention. We propose a clustering-based active learning algorithm to differentiate sensors in buildings by type, e.g., temperature v.s. humidity. Our algorithm exploits data clustering structure and propagates labels to their nearby unlabeled neighbors to accelerate the learning process. We perform a comprehensive study on metadata collected from over 20 different sensor types and 2,500 sensor streams in three commercial buildings. Our approach is able to achieve more than 92% accuracy for type classification with much less labeled examples than baselines. As a proof-of-concept, we also demonstrate a typical analytic application enabled by the normalized metadata.
gSparsify: Graph Motif Based Sparsification for Graph Clustering BIBAFull-Text 373-382
  Peixiang Zhao
Graph clustering is a fundamental problem that partitions vertices of a graph into clusters with an objective to optimize the intuitive notions of intra-cluster density and intercluster sparsity. In many real-world applications, however, the sheer sizes and inherent complexity of graphs may render existing graph clustering methods inefficient or incapable of yielding quality graph clusters. In this paper, we propose gSparsify, a graph sparsification method, to preferentially retain a small subset of edges from a graph which are more likely to be within clusters, while eliminating others with less or no structure correlation to clusters. The resultant simplified graph is succinct in size with core cluster structures well preserved, thus enabling faster graph clustering without a compromise to clustering quality. We consider a quantitative approach to modeling the evidence that edges within densely knitted clusters are frequently involved in small-size graph motifs, which are adopted as prime features to differentiate edges with varied cluster significance. Path-based indexes and path-join algorithms are further designed to compute graph-motif based cluster significance of edges for graph sparsification. We perform experimental studies in real-world graphs, and results demonstrate that gSparsify can bring significant speedup to existing graph clustering methods with an improvement to graph clustering quality.
Incomplete Multi-view Clustering via Subspace Learning BIBAFull-Text 383-392
  Qiyue Yin; Shu Wu; Liang Wang
Multi-view clustering, which explores complementary information between multiple distinct feature sets for better clustering, has a wide range of applications, e.g., knowledge management and information retrieval. Traditional multi-view clustering methods usually assume that all examples have complete feature sets. However, in real applications, it is often the case that some examples lose some feature sets, which results in incomplete multi-view data and notable performance degeneration. In this paper, a novel incomplete multi-view clustering method is therefore developed, which learns unified latent representations and projection matrices for the incomplete multi-view data. To approximate the high level scaled indicator matrix defined to represent class label matrix, the latent representations are expected to be non-negative and column orthogonal. Besides, since data are often with high dimensional and noisy features, the projection matrices are enforced to be sparse so as to select relevant features when learning the latent space. Furthermore, the inter-view and intra-view data structure is preserved to further enhance the clustering performance. To these ends, an objective is developed with efficient optimization strategy and convergence analysis. Extensive experiments demonstrate that our model performs better than the state-of-the-art multi-view clustering methods in various settings.
Robust Subspace Clustering via Tighter Rank Approximation BIBAFull-Text 393-401
  Zhao Kang; Chong Peng; Qiang Cheng
Matrix rank minimization problem is in general NP-hard. The nuclear norm is used to substitute the rank function in many recent studies. Nevertheless, the nuclear norm approximation adds all singular values together and the approximation error may depend heavily on the magnitudes of singular values. This might restrict its capability in dealing with many practical problems. In this paper, an arctangent function is used as a tighter approximation to the rank function. We use it on the challenging subspace clustering problem. For this nonconvex minimization problem, we develop an effective optimization procedure based on a type of augmented Lagrange multipliers (ALM) method. Extensive experiments on face clustering and motion segmentation show that the proposed method is effective for rank approximation.

Session 2E: Users and Predictions

Interactive User Group Analysis BIBAFull-Text 403-412
  Behrooz Omidvar-Tehrani; Sihem Amer-Yahia; Alexandre Termier
User data is becoming increasingly available in multiple domains ranging from phone usage traces to data on the social Web. The analysis of user data is appealing to scientists who work on population studies, recommendations, and large-scale data analytics. We argue for the need for an interactive analysis to understand the multiple facets of user data and address different analytics scenarios. Since user data is often sparse and noisy, we propose to produce labeled groups that describe users with common properties and develop IUGA, an interactive framework based on group discovery primitives to explore the user space. At each step of IUGA, an analyst visualizes group members and may take an action on the group (add/remove members) and choose an operation (exploit/explore) to discover more groups and hence more users. Each discovery operation results in k most relevant and diverse groups. We formulate group exploitation and exploration as optimization problems and devise greedy algorithms to enable efficient group discovery. Finally, we design a principled validation methodology and run extensive experiments that validate the effectiveness of IUGA on large datasets for different user space analysis scenarios.
Viewability Prediction for Online Display Ads BIBAFull-Text 413-422
  Chong Wang; Achir Kalra; Cristian Borcea; Yi Chen
As a massive industry, display advertising delivers advertisers' marketing messages to attract customers through graphic banners on webpages. Advertisers are charged by ad serving, where their ads are shown in web pages. However, recent studies show that about half of the ads were actually never seen by users because they do not scroll deep enough to bring the ads in-view. Thus, the ad pricing standards are shifting to a new model: ads are paid if they are in view, not just being served. To the best of our knowledge, this paper is the first to address the important problem of ad viewability prediction which can improve the performance of guaranteed ad delivery, real-time bidding, as well as recommender systems. We analyze a real-life dataset from a large publisher, identify a number of features that impact the scroll depth for a given user and a page, and propose a probabilistic latent class model that predicts the viewability of any given scroll depth for a user-page pair. The experiments demonstrate that our model outperforms comparison systems based on singular value decomposition and logistic regression, in terms of prediction quality and training time.
10 Bits of Surprise: Detecting Malicious Users with Minimum Information BIBAFull-Text 423-431
  Reza Zafarani; Huan Liu
Malicious users are a threat to many sites and defending against them demands innovative countermeasures. When malicious users join sites, they provide limited information about themselves. With this limited information, sites can find it difficult to distinguish between a malicious user and a normal user. In this study, we develop a methodology that identifies malicious users with limited information. As information provided by malicious users can vary, the proposed methodology utilizes minimum information to identify malicious users. It is shown that as little as 10 bits of information can help greatly in this challenging task. The experiments results verify that this methodology is effective in identifying malicious users in the realistic scenario of limited information availability.
MAPer: A Multi-scale Adaptive Personalized Model for Temporal Human Behavior Prediction BIBAFull-Text 433-442
  Sarah Masud Preum; John A. Stankovic; Yanjun Qi
The primary objective of this research is to develop a simple and interpretable predictive framework to perform temporal modeling of individual user's behavior traits based on each person's past observed traits/behavior. Individual-level human behavior patterns are possibly influenced by various temporal features (e.g., lag, cycle) and vary across temporal scales (e.g., hour of the day, day of the week). Most of the existing forecasting models do not capture such multi-scale adaptive regularity of human behavior or lack interpretability due to relying on hidden variables. Hence, we build a multi-scale adaptive personalized (MAPer) model that quantifies the effect of both lag and behavior cycle for predicting future behavior. MAper includes a novel basis vector to adaptively learn behavior patterns and capture the variation of lag and cycle across multi-scale temporal contexts. We also extend MAPer to capture the interaction among multiple behaviors to improve the prediction performance.
   We demonstrate the effectiveness of MAPer on four real datasets representing different behavior domains, including, habitual behavior collected from Twitter, need based behavior collected from search logs, and activities of daily living collected from a single resident and a multi-resident home. Experimental results indicate that MAPer significantly improves upon the state-of-the-art and baseline methods and at the same time is able to explain the temporal dynamics of individual-level human behavior.

Session 2F: Heterogeneous Networks

Classification with Active Learning and Meta-Paths in Heterogeneous Information Networks BIBAFull-Text 443-452
  Chang Wan; Xiang Li; Ben Kao; Xiao Yu; Quanquan Gu; David Cheung; Jiawei Han
A heterogeneous information network (HIN) is used to model objects of different types and their relationships. Meta-paths are sequences of object types. They are used to represent complex relationships between objects beyond what links in a homogeneous network capture. We study the problem of classifying objects in an HIN. We propose class-level meta-paths and study how they can be used to (1) build more accurate classifiers and (2) improve active learning in identifying objects for which training labels should be obtained. We show that class-level meta-paths and object classification exhibit interesting synergy. Our experimental results show that the use of class-level meta-paths results in very effective active learning and good classification performance in HINs.
Semantic Path based Personalized Recommendation on Weighted Heterogeneous Information Networks BIBAFull-Text 453-462
  Chuan Shi; Zhiqiang Zhang; Ping Luo; Philip S. Yu; Yading Yue; Bin Wu
Recently heterogeneous information network (HIN) analysis has attracted a lot of attention, and many data mining tasks have been exploited on HIN. As an important data mining task, recommender system includes a lot of object types (e.g., users, movies, actors, and interest groups in movie recommendation) and the rich relations among object types, which naturally constitute a HIN. The comprehensive information integration and rich semantic information of HIN make it promising to generate better recommendations. However, conventional HINs do not consider the attribute values on links, and the widely used meta path in HIN may fail to accurately capture semantic relations among objects, due to the existence of rating scores (usually ranging from 1 to 5) between users and items in recommender system. In this paper, we are the first to propose the weighted HIN and weighted meta path concepts to subtly depict the path semantics through distinguishing different link attribute values. Furthermore, we propose a semantic path based personalized recommendation method SemRec to predict the rating scores of users on items. Through setting meta paths, SemRec not only flexibly integrates heterogeneous information but also obtains prioritized and personalized weights representing user preferences on paths. Experiments on two real datasets illustrate that SemRec achieves better recommendation performance through flexibly integrating information with the help of weighted meta paths.
A Graph-based Recommendation across Heterogeneous Domains BIBAFull-Text 463-472
  Deqing Yang; Jingrui He; Huazheng Qin; Yanghua Xiao; Wei Wang
Given the users from a social network site, who have been tagged with a set of terms, how can we recommend the movies tagged with a completely different set of terms hosted by another website? Given the users from a website dedicated to Type I and Type II diabetes, how can we recommend the discussion threads from another website dedicated to gestational diabetes, where the keywords used in the two websites might be quite diverse? In other words, how can we recommend across heterogeneous domains characterized by barely overlapping feature sets?
   Despite the vast amount of existing work devoted to recommendation within homogeneous domains (e.g., with the same set of features), or collaborative filtering, emerging applications call for new techniques to address the problem of recommendation across heterogeneous domains, such as recommending movies hosted by one website to users from another website with barely overlapping tags. To this end, in this paper, we propose a graph-based approach for recommendation across heterogeneous domains. Specifically, for each domain, we use a bipartite graph to represent the relationships between its entities and features. Furthermore, to bridge the gap among multiple heterogeneous domains with barely overlapping sets of features, we propose to infer their semantic relatedness through concept-based interpretation distilled from online encyclopedias, e.g., Wikipedia and Baike. Finally, we propose an efficient propagation algorithm to obtain the similarity between entities from heterogeneous domains. Experimental results on both Weibo-Douban data set and Diabetes data set demonstrate the effectiveness and efficiency of our algorithm.
Query Relaxation across Heterogeneous Data Sources BIBAFull-Text 473-482
  Verena Kantere; George Orfanoudakis; Anastasios Kementsietsidis; Timos Sellis
The fundamental assumption for query rewriting in heterogeneous environments is that the mappings used for the rewriting are complete, i.e., every relation and attribute mentioned in the query is associated, through mappings, to relations and attributes in the schema of the source that the query is rewritten. In reality, it is rarely the case that such complete sets of mappings exist between sources, and the presence of partial mappings is the norm rather than the exception. So, practically, existing query answering algorithms fail to generate any rewriting in the majority of cases. The question is then whether we can somehow relax queries that cannot be rewritten as such (due to insufficient mappings), and whether we can identify the interesting query relaxations, given the mappings at hand.
   In this paper, we propose a technique to compute query relaxations of an input query that can be rewritten and evaluated in an environment of collaborating autonomous and heterogeneous data sources. We extend traditional techniques for query rewriting, and we propose both an exhaustive and an optimized heuristic algorithm to compute and evaluate these relaxations. Our technique works with input of any query similarity measure. The experimental study proves the effectiveness and efficiency of our technique.

Session 3A: Veracity

Approximated Summarization of Data Provenance BIBAFull-Text 483-492
  Eleanor Ainy; Pierre Bourhis; Susan B. Davidson; Daniel Deutch; Tova Milo
Many modern applications involve collecting large amounts of data from multiple sources, and then aggregating and manipulating it in intricate ways. The complexity of such applications, combined with the size of the collected data, makes it difficult to understand how the resulting information was derived. Data provenance has proven helpful in this respect, however, maintaining and presenting the full and exact provenance information may be infeasible due to its size and complexity. We therefore introduce the notion of approximated summarized provenance, which provides a compact representation of the provenance at the possible cost of information loss. Based on this notion, we present a novel provenance summarization algorithm which, based on the semantics of the underlying data and the intended use of provenance, outputs a summary of the input provenance. Experiments measure the conciseness and accuracy of the resulting provenance summaries, and improvement in provenance usage time.
An Integrated Bayesian Approach for Effective Multi-Truth Discovery BIBAFull-Text 493-502
  Xianzhi Wang; Quan Z. Sheng; Xiu Susie Fang; Lina Yao; Xiaofei Xu; Xue Li
Truth-finding is the fundamental technique for corroborating reports from multiple sources in both data integration and collective intelligent applications. Traditional truth-finding methods assume a single true value for each data item and therefore cannot deal will multiple true values (i.e., the multi-truth-finding problem). So far, the existing approaches handle the multi-truth-finding problem in the same way as the single-truth-finding problems. Unfortunately, the multi-truth-finding problem has its unique features, such as the involvement of sets of values in claims, different implications of inter-value mutual exclusion, and larger source profiles. Considering these features could provide new opportunities for obtaining more accurate truth-finding results. Based on this insight, we propose an integrated Bayesian approach to the multi-truth-finding problem, by taking these features into account. To improve the truth-finding efficiency, we reformulate the multi-truth-finding problem model based on the mappings between sources and (sets of) values. New mutual exclusive relations are defined to reflect the possible co-existence of multiple true values. A finer-grained copy detection method is also proposed to deal with sources with large profiles. The experimental results on three real-world datasets show the effectiveness of our approach.
Approximate Truth Discovery via Problem Scale Reduction BIBAFull-Text 503-512
  Xianzhi Wang; Quan Z. Sheng; Xiu Susie Fang; Xue Li; Xiaofei Xu; Lina Yao
Many real-world applications rely on multiple data sources to provide information on their interested items. Due to the noises and uncertainty in data, given a specific item, the information from different sources may conflict. To make reliable decisions based on these data, it is important to identify the trustworthy information by resolving these conflicts, i.e., the truth discovery problem. Current solutions to this problem detect the veracity of each value jointly with the reliability of each source for each data item. In this way, the efficiency of truth discovery is strictly confined by the problem scale, which in turn limits truth discovery algorithms from being applicable on a large scale. To address this issue, we propose an approximate truth discovery approach, which divides sources and values into groups according to a user-specified approximation criterion. The groups are then used for efficient inter-value influence computation to improve the accuracy. Our approach is applicable to most existing truth discovery algorithms. Experiments on real-world datasets show that our approach improves the efficiency compared to existing algorithms while achieving similar or even better accuracy. The scalability is further demonstrated by experiments on large synthetic datasets.

Session 3B: Social Networks 1

Organic or Organized?: Exploring URL Sharing Behavior BIBAFull-Text 513-522
  Cheng Cao; James Caverlee; Kyumin Lee; Hancheng Ge; Jinwook Chung
URL sharing has become one of the most popular activities on many online social media platforms. Shared URLs are an avenue to interesting news articles, memes, photos, as well as low-quality content like spam, promotional ads, and phishing sites. While some URL sharing is organic, other sharing is strategically organized with a common purpose (e.g., aggressively promoting a website). In this paper, we investigate the individual-based and group-based user behavior of URL sharing in social media toward uncovering these organic versus organized user groups. Concretely, we propose a four-phase approach to model, identify, characterize, and classify organic and organized groups who engage in URL sharing. The key motivating insights of this approach are (i) that patterns of individual-based behavioral signals embedded in URL posting activities can uncover groups whose members engage in similar behaviors; and (ii) that group-level behavioral signals can distinguish between organic and organized user groups. Through extensive experiments, we find that levels of organized behavior vary by URL type and that the proposed approach achieves good performance -- an F-measure of 0.836 and Area Under the Curve of 0.921.
Mining Brokers in Dynamic Social Networks BIBAFull-Text 523-532
  Chonggang Song; Wynne Hsu; Mong Li Lee
The theory of brokerage in sociology suggests if contacts between two parties are enabled through a third party, the latter occupies a strategic position of controlling information flows. Such individuals are called brokers and they play a key role in disseminating information. However, there is no systematic approach to identify brokers in online social networks. In this paper, we formally define the problem of detecting top-$k$ brokers given a social network and show that it is NP-hard. We develop a heuristic algorithm to find these brokers based on the weak tie theory. In order to handle the dynamic nature of online social networks, we design incremental algorithms: WeakTie-Local for unidirectional networks and WeakTie-Bi for bidirectional networks. We use two real world datasets, DBLP and Twitter, to evaluate the proposed methods. We also demonstrate how the detected brokers are useful in diffusing information across communities and propagating tweets to reach more distinct users.
Who Will You "@"? BIBAFull-Text 533-542
  Yeyun Gong; Qi Zhang; Xuyang Sun; Xuanjing Huang
In Twitter-like social networking services, people can use the "@" symbol to mention other users in tweets and send them a message or link to their profiles. In recent years, social media services are rapidly growing with thousands of millions of users participating in them every day. When the "@" symbol is entered, there should be an automatic suggestion function which recommends a small list of candidates in order to help users to easily identify and input usernames. In this paper, we present our work on building a recommendation system for the mention function in microblogging services. The recommendation strategy we used takes into consideration not only content of the microblog but also histories of candidate users. To better handle these textual information, we propose a novel method that extends the translation-based model. Experimental results on the dataset we collected from a real world microblogging service demonstrate that the proposed method outperforms state-of-the-art approaches.

Session 3C: Query Completion

Characterizing and Predicting Voice Query Reformulation BIBAFull-Text 543-552
  Ahmed Hassan Awadallah; Ranjitha Gurunath Kulkarni; Umut Ozertem; Rosie Jones
Voice interactions are becoming more prevalent as the usage of voice search and intelligent assistants gains more popularity. Users frequently reformulate their requests in hope of getting better results either because the system was unable to recognize what they said or because it was able to recognize it but was unable to return the desired response. Query reformulation has been extensively studied in the context of text input. Many of the characteristics studied in the context of text query reformulation are potentially useful for voice query reformulation. However, voice query reformulation has its unique characteristics in terms of the reasons that lead users to reformulating their queries and how they reformulate them. In this paper, we study the problem of voice query reformulation. We perform a large scale human annotation study to collect thousands of labeled instances of voice reformulation and non-reformulation query pairs. We use this data to compare and contrast characteristics of reformulation and non-reformulation queries over a large a number of dimensions. We then train classifiers to distinguish between reformulation and non-reformulation query pairs and to predict the rationale behind reformulation. We demonstrate through experiments with the human labeled data that our classifiers achieve good performance in both tasks.
A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion BIBAFull-Text 553-562
  Alessandro Sordoni; Yoshua Bengio; Hossein Vahabi; Christina Lioma; Jakob Grue Simonsen; Jian-Yun Nie
Users may strive to formulate an adequate textual query for their information need. Search engines assist the users by presenting query suggestions. To preserve the original search intent, suggestions should be context-aware and account for the previous queries issued by the user. Achieving context awareness is challenging due to data sparsity. We present a novel hierarchical recurrent encoder-decoder architecture that makes possible to account for sequences of previous queries of arbitrary lengths. As a result, our suggestions are sensitive to the order of queries in the context while avoiding data sparsity. Additionally, our model can suggest for rare, or long-tail, queries. The produced suggestions are synthetic and are sampled one word at a time, using computationally cheap decoding techniques. This is in contrast to current synthetic suggestion models relying upon machine learning pipelines and hand-engineered feature sets. Results show that our model outperforms existing context-aware approaches in a next query prediction setting. In addition to query suggestion, our architecture is general enough to be used in a variety of other applications.
A Network-Aware Approach for Searching As-You-Type in Social Media BIBAFull-Text 563-572
  Paul Lagrée; Bogdan Cautis; Hossein Vahabi
We present in this paper a novel approach for as-you-type top-k keyword search over social media. We adopt a natural "network-aware" interpretation for information relevance, by which information produced by users who are closer to the seeker is considered more relevant. In practice, this query model poses new challenges for effectiveness and efficiency in online search, even when a complete query is given as input in one keystroke. This is mainly because it requires a joint exploration of the social space and classic IR indexes such as inverted lists. We describe a memory-efficient and incremental prefix-based retrieval algorithm, which also exhibits an anytime behavior, allowing to output the most likely answer within any chosen running-time limit. We evaluate it through extensive experiments for several applications and search scenarios, including searching for posts in micro-blogging (Twitter and Tumblr), as well as searching for businesses based on reviews in Yelp. They show that our solution is effective in answering real-time as-you-type searches over social media.

Session 3D: Microblogs

Improving Microblog Retrieval with Feedback Entity Model BIBAFull-Text 573-582
  Feifan Fan; Runwei Qiang; Chao Lv; Jianwu Yang
When searching over the microblogging, users prefer using queries including terms that represent some specific entities. Meanwhile, tweets, though limited within 140 characters, are often generated with one or more entities. Entities, as an important part of tweets, usually convey rich information for modeling relevance from new perspectives. In this paper, we propose a feedback entity model and integrate it into an adaptive language modeling framework in order to improve the retrieval performance. The feedback entity model is estimated with the latest entity-associated tweets based upon a regularized maximum likelihood criterion. More specifically, we assume that the entity-associated tweets are generated by a mixture model, which consists of the entity model, the domain-specific language model and the collection language model. Experimental results on two public Text Retrieval Conference (TREC) Twitter corpora demonstrate the significant superiority of our approach over the state-of-the-art baselines.
Extracting Situational Information from Microblogs during Disaster Events: a Classification-Summarization Approach BIBAFull-Text 583-592
  Koustav Rudra; Subham Ghosh; Niloy Ganguly; Pawan Goyal; Saptarshi Ghosh
Microblogging sites like Twitter have become important sources of real-time information during disaster events. A significant amount of valuable situational information is available in these sites; however, this information is immersed among hundreds of thousands of tweets, mostly containing sentiments and opinion of the masses, that are posted during such events. To effectively utilize microblogging sites during disaster events, it is necessary to (i) extract the situational information from among the large amounts of sentiment and opinion, and (ii) summarize the situational information, to help decision-making processes when time is critical. In this paper, we develop a novel framework which first classifies tweets to extract situational information, and then summarizes the information. The proposed framework takes into consideration the typicalities pertaining to disaster events where (i) the same tweet often contains a mixture of situational and non-situational information, and (ii) certain numerical information, such as number of casualties, vary rapidly with time, and thus achieves superior performance compared to state-of-the-art tweet summarization approaches.
Profession-Based Person Search in Microblogs: Using Seed Sets to Find Journalists BIBAFull-Text 593-602
  Mossaab Bagdouri; Douglas W. Oard
We introduce the problem of searching for professionals in microblogging platforms. We describe a study of how a group of professional journalists with some common characteristics (e.g., works in a specific language, belongs to certain region, or specializes in a particular media) can be found. Starting from seed sets of different sizes, social network features and profile content features are used to find additional journalists. The results show that combining the social network features of the reciprocated mentions and a bidirectional friend/follower graph provides a signal stronger than either of them taken independently, that both social network and profile content features are useful, and that profile content features are able to find larger numbers of less prominent journalists. We apply our methods to find the Twitter accounts of British and Arab journalists.

Session 3E: Graph-Based Analysis

Learning Entity Types from Query Logs via Graph-Based Modeling BIBAFull-Text 603-612
  Jingyuan Zhang; Luo Jie; Altaf Rahman; Sihong Xie; Yi Chang; Philip S. Yu
Entities (e.g., person, movie or place) play an important role in real-world applications and learning entity types has attracted much attention in recent years. Most conventional automatic techniques use large corpora, such as news articles, to learn types of entities. However, such text corpora focus on general knowledge about entities in an objective way. Hence, it is difficult to satisfy those users with specific and personalized needs for an entity. Recent years have witnessed an explosive expansion in the mining of search query logs, which contain billions of entities. The word patterns and click-throughs in search logs are not found in text corpora, thus providing a complemental source for discovering entity types based on user behaviors. In this paper, we study the problem of learning entity types from search query logs and address the following challenges: (1) queries are short texts, and information related to entities is usually very sparse; (2) large amounts of irrelevant information exists in search logs, bringing noise in detecting entity types. In this paper, we first model query logs using a bipartite graph with entities and their auxiliary information, such as contextual words and clicked URLs. Then we propose a graph-based framework called ELP (Ensemble framework based on Lable Propagation) to simultaneously learn the types of both entities and auxiliary signals. In ELP, two separate strategies are designed to fix the problems of sparsity and noise in query logs. Extensive empirical studies are conducted on real search logs to evaluate the effectiveness of the proposed ELP framework.
Collaborative Prediction for Multi-entity Interaction With Hierarchical Representation BIBAFull-Text 613-622
  Qiang Liu; Shu Wu; Liang Wang
With the rapid growth of Internet applications, there are more and more entities in interaction scenarios, and thus collaborative prediction for multi-entity interaction is becoming a significant problem. The state-of-the-art methods, e.g., tensor factorization and factorization machine, predict multi-entity interaction based on calculating the similarity among all entities. However, these methods are usually not able to reveal the joint characteristics of entities in the interaction. Besides, some methods may succeed in one specific application, but they can not be extended effectively for other applications or interaction scenarios with more entities. In this work, we propose a Hierarchical Interaction Representation (HIR) model, which models the mutual action among different entities as a joint representation. We generate the interaction representation of two entities via tensor multiplication, which is preformed iteratively to construct a hierarchical structure among all entities. Moreover, we employ several hidden layers to reveal the underlying properties of this interaction and enhance the model performance. After generating final representation, the prediction can be calculated using a variety of machine learning methods according to different tasks (i.e., linear regression for regression tasks, pair-wise ranking for ranking tasks and logistic regression for classification tasks). Experimental results show that our proposed HIR model yields significant improvements over the competitive compared methods in four different application scenarios (i.e., general recommendation, context-aware recommendation, latent collaborative retrieval and click-through rate prediction).
Learning to Represent Knowledge Graphs with Gaussian Embedding BIBAFull-Text 623-632
  Shizhu He; Kang Liu; Guoliang Ji; Jun Zhao
The representation of a knowledge graph (KG) in a latent space recently has attracted more and more attention. To this end, some proposed models (e.g., TransE) embed entities and relations of a KG into a "point" vector space by optimizing a global loss function which ensures the scores of positive triplets are higher than negative ones. We notice that these models always regard all entities and relations in a same manner and ignore their (un)certainties. In fact, different entities and relations may contain different certainties, which makes identical certainty insufficient for modeling. Therefore, this paper switches to density-based embedding and propose KG2E for explicitly modeling the certainty of entities and relations, which learn the representations of KGs in the space of multi-dimensional Gaussian distributions. Each entity/relation is represented by a Gaussian distribution, where the mean denotes its position and the covariance (currently with diagonal covariance) can properly represent its certainty. In addition, compared with the symmetric measures used in point-based methods, we employ the KL-divergence for scoring triplets, which is a natural asymmetry function for effectively modeling multiple types of relations. We have conducted extensive experiments on link prediction and triplet classification with multiple benchmark datasets (WordNet and Freebase). Our experimental results demonstrate that our method can effectively model the (un)certainties of entities and relations in a KG, and it significantly outperforms state-of-the-art methods (including TransH and TransR).

Session 3F: Classification 1

Associative Classification with Statistically Significant Positive and Negative Rules BIBAFull-Text 633-642
  Jundong Li; Osmar Zaiane
Rule-based classifier has shown its popularity in building many decision support systems such as medical diagnosis and financial fraud detection. One major advantage is that the models are human understandable and can be edited. Associative classifiers, as an extension of rule-based classifiers, use association rules to associate attributes with class labels. A delicate issue of associative classifiers is the need for subtle thresholds: minimum support and minimum confidence. Without prior knowledge, it could be difficult to choose the proper thresholds, and the discovered rules within the support-confidence framework are not statistically significant, i.e., inclusion of noisy rules and exclusion of valuable rules. Besides, most associative classifiers proposed so far, are built with only positive association rules. Negative rules, however, are also able to provide valuable information to discriminate between classes. To solve the above mentioned problems, we propose a novel associative classifier which is built upon both positive and negative classification association rules that show statistically significant dependencies. Experimental results on real-world datasets show that our method achieves competitive or even better performance than well-known rule-based and associative classifiers in terms of both classification accuracy and computational efficiency.
A Min-Max Optimization Framework For Online Graph Classification BIBAFull-Text 643-652
  Peng Yang; Peilin Zhao
Traditional online learning for graph node classification adapts graph regularization into ridge regression, which may not be suitable when data is adversarially generated. To solve this issue, we propose a more general min-max optimization framework for online graph node classification. The derived online algorithm can achieve a min-max regret compared with the optimal linear model found offline. However, this algorithm assumes that the label is provided for every node, while label is scare and labeling is usually either too time-consuming or expensive in real-world applications. To save labeling effort, we propose a novel confidence-based query approach to prioritize the informative labels. Our theoretical result shows that an online algorithm learning on these selected labels can achieve comparable mistake bound with the fully-supervised online counterpart. To take full advantage of these labels, we propose an aggressive algorithm, which can update the model even if no error occurs. Theoretical analysis shows that the mistake bound of the proposed method, thanks to the aggressive update trials, is better than conservative competitor in expectation. We finally empirically evaluate it on several real-world graph databases. Encouraging experimental results further demonstrate the effectiveness of our method.
An Inference Approach to Basic Level of Categorization BIBAFull-Text 653-662
  Zhongyuan Wang; Haixun Wang; Ji-Rong Wen; Yanghua Xiao
Humans understand the world by classifying objects into an appropriate level of categories. This process is often automatic and subconscious. Psychologists and linguists call it as Basic-level Categorization (BLC). BLC can benefit lots of applications such as knowledge panel, advertising and recommendation. However, how to quantify basic-level concepts is still an open problem. Recently, much work focuses on constructing knowledge bases or semantic networks from web scale text corpora, which makes it possible for the first time to analyze computational approaches for deriving BLC. In this paper, we introduce a method based on typicality and PMI for BLC. We compare it with a few existing measures such as NPMI and commute time to understand its essence, and conduct extensive experiments to show the effectiveness of our approach. We also give a real application example to show how BLC can help sponsored search.

Keynote Address II

Making Sense of Spatial Trajectories BIBAFull-Text 671-672
  Xiaofang Zhou; Kai Zheng; Hoyoung Jueng; Jiajie Xu; Shazia Sadiq
Spatial trajectory data is widely available today. Over a sustained period of time, trajectory data has been collected from numerous GPS devices, smartphones, sensors and social media applications. Daily increases of real-time trajectory data have also been phenomenal in recent years. More and more new applications have emerged to derive business values from both trajectory data warehouses and real-time trajectory data. Due to their very large volumes, their nature of streaming, their highly variable levels of data quality, as well as many possible links with other types of data, making sense of spatial trajectory data becomes one of the crucial areas for big data analytics. In this paper we will present a review of the extensive work in spatiotemporal data management and trajectory mining, and discuss new challenges and new opportunities in the context of new applications, focusing on recent advances in trajectory data management and trajectory mining from their foundations to high performance processing with modern computing infrastructure.

Session 4A: Location-Based Services

ReverseCloak: Protecting Multi-level Location Privacy over Road Networks BIBAFull-Text 673-682
  Chao Li; Balaji Palanisamy
With advances in sensing and positioning technology, fueled by the ubiquitous deployment of wireless networks, location-aware computing has become a fundamental model for offering a wide range of life enhancing services. However, the ability to locate users and mobile objects opens doors for new threats -- the intrusion of location privacy. Location anonymization refers to the process of perturbing the exact location of users as a cloaking region such that a user's location becomes indistinguishable from the location of a set of other users. A fundamental limitation of existing location anonymization techniques is that location information once perturbed to provide a certain anonymity level cannot be reversed to reduce anonymity or the degree of perturbation. This is especially a serious limiting factor in multi-level privacy-controlled scenarios where different users of the location information have different levels of access. This paper presents ReverseCloak, a new class of reversible location cloaking mechanisms that effectively support multi-level location privacy, allowing selective de-anonymization of the cloaking region to reduce the granularity of the perturbed location when suitable access credentials are provided. We evaluate the ReverseCloak techniques through extensive experiments on realistic road network traces generated by GTMobiSim. Our experiments show that the proposed techniques are efficient, scalable and provide the required level of privacy.
GLUE: a Parameter-Tuning-Free Map Updating System BIBAFull-Text 683-692
  Hao Wu; Chuanchuan Tu; Weiwei Sun; Baihua Zheng; Hao Su; Wei Wang
Map data are widely used in mobile services, but most maps might not be complete. Updating the map automatically is an important problem because road networks are frequently changed with the development of the city. This paper studies the problem of recovering missing road segments via GPS trajectories, especially low sampled data. Our approach takes the GPS noise into consideration and proposes an effective self-adaptive algorithm. Besides, we propose theoretical models behind all the important parameters to enable self-adaptive parameter setting. To the best of our knowledge, this is the first work that addresses the parameter setting issue successfully to make sure our approach is free of parameter-tuning. In addition, we also propose a quantitative evaluation method for map updating problem. The result shows our algorithm has a much better performance than the existing approaches.
A Cost-based Method for Location-Aware Publish/Subscribe Services BIBAFull-Text 693-702
  Minghe Yu; Guoliang Li; Jianhua Feng
Location-based services have attracted significant attentions from both industry and academia, thanks to modern smartphones and mobile Internet. To provide users with gratifications, location-aware publish/subscribe has been recently proposed, which delivers spatio-textual messages of publishers to subscribers whose registered spatio-textual subscriptions are relevant to the messages. Since there could be large numbers of subscriptions, it is necessary to devise an efficient location-aware publish/subscribe system to enable instant message filtering. To this end, in this paper we propose two novel indexing structures, mbrtrie and PKQ. Using the indexes, we devise two filtering algorithms to support fast message filtering. We analyze the complexities of the two filtering algorithms and develop a cost-based model to judiciously select the best filtering algorithm for different scenarios. The experimental results show that our method achieves high performance and significantly outperforms the baseline approaches.
Probabilistic Forecasts of Bike-Sharing Systems for Journey Planning BIBAFull-Text 703-712
  Nicolas Gast; Guillaume Massonnet; Daniel Reijsbergen; Mirco Tribastone
We study the problem of making forecasts about the future availability of bicycles in stations of a bike-sharing system (BSS). This is relevant in order to make recommendations guaranteeing that the probability that a user will be able to make a journey is sufficiently high. To do this we use probabilistic predictions obtained from a queuing theoretical time-inhomogeneous model of a BSS. The model is parametrized and successfully validated using historical data from the Vélib' BSS of the City of Paris.
   We develop a critique of the standard root-mean-square-error (RMSE), commonly adopted in the bike-sharing research as an index of the prediction accuracy, because it does not account for the stochasticity inherent in the real system. Instead we introduce a new metric based on scoring rules. We evaluate the average score of our model against classical predictors used in the literature. We show that these are outperformed by our model for prediction horizons of up to a few hours. We also discuss that, in general, measuring the current number of available bikes is only relevant for prediction horizons of up to few hours.

Session 4B: Query Explanation

Efficient Computation of Polynomial Explanations of Why-Not Questions BIBAFull-Text 713-722
  Nicole Bidoit; Melanie Herschel; Aikaterini Tzompanaki
Answering a Why-Not question consists in explaining why a query result does not contain some expected data, called missing answers. This paper focuses on processing Why-Not questions in a query-based approach that identifies the culprit query components. Our first contribution is a general definition of a Why-Not explanation by means of a polynomial. Intuitively, the polynomial provides all possible explanations to explore in order to recover the missing answers, together with an estimation of the number of recoverable answers. Moreover, this formalism allows us to represent Why-Not explanations in a unified way for extended relational models with probabilistic or bag semantics. We further present an algorithm to efficiently compute the polynomial for a given Why-Not question. An experimental evaluation demonstrates the practicality of the solution both in terms of efficiency and explanation quality, compared to existing algorithms.
Interruption-Sensitive Empty Result Feedback: Rethinking the Visual Query Feedback Paradigm for Semistructured Data BIBAFull-Text 723-732
  Sourav S. Bhowmick; Curtis Dyreson; Byron Choi; Min-Hwee Ang
The usability of visual querying schemes for tree and graph-structured data can be greatly enhanced by providing feedback during query construction, but feedback at inopportune times can hamper query construction. In this paper, we rethink the traditional way of providing feedback. We describe a novel vision of interruption-sensitive query feedback where relevant notifications are delivered quickly but at an appropriate moment when the mental workload of the user is low. Though we focus on one class of query feedback, namely empty result detection, where a user is notified when a partially constructed visual query yields an empty result, our new paradigm is applicable to other kinds of feedback. We present a framework called iSERF that bridges the classical database problem of empty-result detection with intelligent notification management from the domains of HCI and psychology. Instead of immediate notification, iSERF considers the structure of query formulation tasks and breakpoints when reasoning about when to notify the user. We present an HCI-inspired model to quantify the performance bounds that iSERF must abide by for checking for an empty result in order to ensure interruption-sensitive notification at optimal breakpoints. We implement this framework in the context of visual XML query formulation and highlight its effectiveness empirically.
Implementing Query Completeness Reasoning BIBAFull-Text 733-742
  Werner Nutt; Sergey Paramonov; Ognjen Savkovic
Data completeness is commonly regarded as one of the key aspects of data quality. With this paper we make two main contributions: (i) we develop techniques to reason about the completeness of a query answer over a partially complete database, taking into account constraints that hold over the database, and (ii) we implement them by an encoding into logic programming paradigms. As constraints we consider primary and foreign keys as well as finite domain constraints. In this way we can identify more situations in which a query is complete than was possible with previous work. For each combination of constraints, we establish characterizations of the completeness reasoning and we show how to translate them into logic programs. As a proof of concept we ran our encodings against test cases that capture characteristics of a real-world scenario.
Towards Scalable and Complete Query Explanation with OWL 2 EL Ontologies BIBAFull-Text 743-752
  Zhe Wang; Mahsa Chitsaz; Kewen Wang; Jianfeng Du
Ontology-mediated data access and management systems are rapidly emerging. Besides standard query answering, there is also a need for such systems to be coupled with explanation facilities, in particular to explain missing query answers (i.e. desired answers of a query which are not derivable from the given ontology and data). This support is highly demanded for debugging and maintenance of big data, and both theoretical results and algorithms proposed. However, existing query explanation algorithms either cannot scale over relative large data sets or are not guaranteed to compute all desired explanations. To the best of our knowledge, no existing algorithm can efficiently and completely explain conjunctive queries (CQs) w.r.t. ELH1 ontologies. In this paper, we present a hybrid approach to achieve this. An implementation of the proposed query explanation algorithm has been developed using an off-the-shelf Prolog engine and a datalog engine. Finally, the system is evaluated over practical ontologies. Experimental results show that our system scales over large data sets.

Session 4C: Crowds

Crowdsourcing Pareto-Optimal Object Finding By Pairwise Comparisons BIBAFull-Text 753-762
  Abolfazl Asudeh; Gensheng Zhang; Naeemul Hassan; Chengkai Li; Gergely V. Zaruba
This is the first study of crowdsourcing Pareto-optimal object finding over partial orders and by pairwise comparisons, which has applications in public opinion collection, group decision making, and information exploration. Departing from prior studies on crowdsourcing skyline and ranking queries, it considers the case where objects do not have explicit attributes and preference relations on objects are strict partial orders. The partial orders are derived by aggregating crowdsourcers' responses to pairwise comparison questions. The goal is to find all Pareto-optimal objects by the fewest possible questions. It employs an iterative question-selection framework. Guided by the principle of eagerly identifying non-Pareto optimal objects, the framework only chooses candidate questions which must satisfy three conditions. This design is both sufficient and efficient, as it is proven to find a short terminal question sequence. The framework is further steered by two ideas -- macro-ordering and micro-ordering. By different micro-ordering heuristics, the framework is instantiated into several algorithms with varying power in pruning questions. Experiment results using both real crowdsourcing marketplace and simulations exhibited not only orders of magnitude reductions in questions when compared with a brute-force approach, but also close-to-optimal performance from the most efficient instantiation.
Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics BIBAFull-Text 763-772
  Alexey Drutsa; Anna Ufliand; Gleb Gusev
Online controlled experiments, e.g., A/B testing, is the state-of-the-art approach used by modern Internet companies to improve their services based on data-driven decisions. The most challenging problem is to define an appropriate online metric of user behavior, so-called Overall Evaluation Criterion (OEC), which is both interpretable and sensitive. A typical OEC consists of a key metric and an evaluation statistic. Sensitivity of an OEC to the treatment effect of an A/B test is measured by a statistical significance test. We introduce the notion of Overall Acceptance Criterion (OAC) that includes both the components of an OEC and a statistical significance test. While existing studies on A/B tests are mostly concentrated on the first component of an OAC, its key metric, we widely study the two latter ones by comparison of several statistics and several statistical tests with respect to user engagement metrics on hundreds of A/B experiments run on real users of Yandex. We discovered that the application of the state-of-the-art Student's t-tests to several main user engagement metrics may lead to an underestimation of the false-positive rate by an order of magnitude. We investigate both well-known and novel techniques to overcome this issue in practical settings. At last, we propose the entropy and the quantiles as novel OECs that reflect the diversity and extreme cases of user engagement.
Generalized Team Draft Interleaving BIBAFull-Text 773-782
  Eugene Kharitonov; Craig Macdonald; Pavel Serdyukov; Iadh Ounis
Interleaving is an online evaluation method that compares two ranking functions by mixing their results and interpreting the users' click feedback. An important property of an interleaving method is its sensitivity, i.e. the ability to obtain reliable comparison outcomes with few user interactions. Several methods have been proposed so far to improve interleaving sensitivity, which can be roughly divided into two areas: (a) methods that optimize the credit assignment function (how the click feedback is interpreted), and (b) methods that achieve higher sensitivity by controlling the interleaving policy (how often a particular interleaved result page is shown).
   In this paper, we propose an interleaving framework that generalizes the previously studied interleaving methods in two aspects. First, it achieves a higher sensitivity by performing a joint data-driven optimization of the credit assignment function and the interleaving policy. Second, we formulate the framework to be general w.r.t. the search domain where the interleaving experiment is deployed, so that it can be applied in domains with grid-based presentation, such as image search. In order to simplify the optimization, we additionally introduce a stratified estimate of the experiment outcome. This stratification is also useful on its own, as it reduces the variance of the outcome and thus increases the interleaving sensitivity.
   We perform an extensive experimental study using large-scale document and image search datasets obtained from a commercial search engine. The experiments show that our proposed framework achieves marked improvements in sensitivity over effective baselines on both datasets.
Exploiting Document Content for Efficient Aggregation of Crowdsourcing Votes BIBAFull-Text 783-790
  Martin Davtyan; Carsten Eickhoff; Thomas Hofmann
The use of crowdsourcing for document relevance assessment has been found to be a viable alternative to corpus annotation by highly trained experts. The question of quality control is a recurring challenge that is often addressed by aggregating multiple individual assessments of the same topic-document pair from independent workers. In the past, such aggregation schemes have been weighted or filtered by estimates of worker reliability based on a multitude of behavioral features. In this paper, we propose an alternative approach by relying on document information. Inspired by the clustering hypothesis of information retrieval, we assume textually similar documents to show similar degrees of relevance towards a given topic. Following up on this intuition, we propagate crowd-generated relevance judgments to similar documents, effectively smoothing the distribution of relevance labels across the similarity space.
   Our experiments are based on TREC Crowdsourcing Track data and show that even simple aggregation methods utilizing document similarity information significantly improve over majority voting in terms of accuracy as well as cost efficiency.

Session 4D: Optimization

L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning BIBAFull-Text 791-800
  David C. Anastasiu; George Karypis
The k-nearest neighbor graph is often used as a building block in information retrieval, clustering, online advertising, and recommender systems algorithms. The complexity of constructing the exact k-nearest neighbor graph is quadratic on the number of objects that are compared, and most existing methods solve the problem approximately. We present L2Knng, an efficient algorithm that finds the exact cosine similarity k-nearest neighbor graph for a set of sparse high-dimensional objects. Our algorithm quickly builds an approximate solution to the problem, identifying many of the most similar neighbors, and then uses theoretic bounds on the similarity of two vectors, based on the L2-norm of part of the vectors, to find each object's exact k-neighborhood. We perform an extensive evaluation of our algorithm, comparing against both exact and approximate baselines, and demonstrate the efficiency of our method across a variety of real-world datasets and neighborhood sizes. Our approximate and exact L2Knng variants compute the k-nearest neighbor graph up to an order of magnitude faster than their respective baselines.
Lingo: Linearized Grassmannian Optimization for Nuclear Norm Minimization BIBAFull-Text 801-809
  Qian Li; Wenjia Niu; Gang Li; Yanan Cao; Jianlong Tan; Li Guo
As a popular heuristic to the matrix rank minimization problem, nuclear norm minimization attracts intensive research attentions. Matrix factorization based algorithms can reduce the expensive computation cost of SVD for nuclear norm minimization. However, most matrix factorization based algorithms fail to provide the theoretical guarantee for convergence caused by their non-unique factorizations. This paper proposes an efficient and accurate Linearized Grassmannian Optimization (Lingo) algorithm, which adopts matrix factorization and Grassmann manifold structure to alternatively minimize the subproblems. More specially, linearization strategy makes the auxiliary variables unnecessary and guarantees the close-form solution for low per-iteration complexity. Lingo then converts linearized objective function into a nuclear norm minimization over Grassmannian manifold, which could remedy the non-unique of solution for the low-rank matrix factorization. Extensive comparison experiments demonstrate the accuracy and efficiency of Lingo algorithm. The global convergence of Lingo is guaranteed with theoretical proof, which also verifies the effectiveness of Lingo.
Deep Collaborative Filtering via Marginalized Denoising Auto-encoder BIBAFull-Text 811-820
  Sheng Li; Jaya Kawale; Yun Fu
Collaborative filtering (CF) has been widely employed within recommender systems to solve many real-world problems. Learning effective latent factors plays the most important role in collaborative filtering. Traditional CF methods based upon matrix factorization techniques learn the latent factors from the user-item ratings and suffer from the cold start problem as well as the sparsity problem. Some improved CF methods enrich the priors on the latent factors by incorporating side information as regularization. However, the learned latent factors may not be very effective due to the sparse nature of the ratings and the side information. To tackle this problem, we learn effective latent representations via deep learning. Deep learning models have emerged as very appealing in learning effective representations in many applications. In particular, we propose a general deep architecture for CF by integrating matrix factorization with deep feature learning. We provide a natural instantiations of our architecture by combining probabilistic matrix factorization with marginalized denoising stacked auto-encoders. The combined framework leads to a parsimonious fit over the latent features as indicated by its improved performance in comparison to prior state-of-art models over four large datasets for the tasks of movie/book recommendation and response prediction.
Improving Latent Factor Models via Personalized Feature Projection for One Class Recommendation BIBAFull-Text 821-830
  Tong Zhao; Julian McAuley; Irwin King
Latent Factor models, which transform both users and items into the same latent feature space, are one of the most successful and ubiquitous models in recommender systems. Most existing models in this paradigm define both users' and items' latent factors to be of the same size and use an inner product to represent a user's "compatibility" with an item. Intuitively, users' factors encode "preferences" while item factors encode "properties", so that the inner product encodes how well an item matches a user's preferences. However, a user's opinion of an item may be more complex, for example each dimension of each user's opinion may depend on a combination of multiple item factors simultaneously. Thus it may be better to view each dimension of a user's preference as a personalized projection of an item's properties so that the preference model can capture complex relationships between items' properties and users' preferences.
   Therefore, in this paper we propose a novel personalized feature projection method to model users' preferences over items. Specifically, for each user, we define a personalized projection matrix, which takes the place of user-specific factors from existing models. This matrix describes a mapping between items' factors and users' preferences in order to build personalized preference models for each user and item. The proposed personalized feature projection method is quite general and existing latent factor models, for example, can be cast as a special case. We present three objective functions to optimize predictions in the form of ranked lists of users' preferences over items, and demonstrate how each can be used to improve one-class recommendation performance. Experiments are conducted on four real-world datasets and our results show that our personalized feature projection method outperforms several state-of-the-art methods on various evaluation metrics.

Session 4E: Social Networks 2

Node Immunization over Infectious Period BIBAFull-Text 831-840
  Chonggang Song; Wynne Hsu; Mong Li Lee
Locating nodes to immunize in computer/social networks to control the spread of virus or rumors has become an important problem. In real world contagions, nodes may get infected by external sources when the propagation is underway. While most studies formalize the problem in a setting where contagion starts at one time point, we model a more realistic situation where there are likely to be many breakouts of contagions over a time window. We call this the node immunization over infectious period (NIIP) problem. We show that the NIIP problem is NP-hard and remains so even in directed acyclic graphs. We propose a NIIP algorithm to select $k$ nodes to immunize over a time period. Simulation is performed to estimate a good distribution of $k$ over the time period. For each time point, the NIIP algorithm will make decisions which nodes to immunize given the estimated value of $k$ for that time point. Experiments show that the proposed NIIP algorithm outperform the state-of-the-art algorithms in terms of both effectiveness and efficiency.
Enterprise Social Link Recommendation BIBAFull-Text 841-850
  Jiawei Zhang; Yuanhua Lv; Philip Yu
Many companies have started to use Enterprise Social Networks (ESNs), such as Yammer, to facilitate collaboration and communication amongst their employees in the business context. Social link recommendation, which finds and suggests whom one wants to connect with in a company, is crucial for ESNs to promote their usages. Although link recommendation has been studied extensively in external social networks (e.g., Facebook and Twitter), it has not been addressed in ESNs. In this paper, we study this novel problem. Social link recommendation in ESNs is significantly different from that in external social networks, and also has unique challenges: (1) people usually socialize differently in enterprise than in their personal life, but users' social behaviors in enterprise have not been well explored, and (2) there is important business information available in ESNs under the enterprise context, e.g., a company?s organizational chart, but how to exploit it for link recommendation is still an open problem. To this end, we mine not only the social graph and user-generated content in ESNs, but also the company's organizational chart, to model enterprise user social behaviors. We develop a supervised link recommendation algorithm using a large scale enterprise social network based on Yammer (with over 100k users), which shows that the proposed techniques perform effectively. Moreover, we find that both the social graph and the organizational chart are complementary to each other for link recommendation in ESNs.
Exploiting Game Theoretic Analysis for Link Recommendation in Social Networks BIBAFull-Text 851-860
  Tong Zhao; H. Vicky Zhao; Irwin King
The popularity of Online Social Networks (OSNs) has attracted great research interests in different fields. In Economics, researchers use game theory to analyze the mechanism of network formation, which is called Network Formation Game. While in Computer Science, much effort has been done in building machine learning models to predict future or missing links. However, there are few works considering how to combine game theoretic analysis and machine learning models. Therefore, in this paper, we study the problem of Exploiting Game Theoretic Analysis for Link Recommendation in Social Networks. Our goal is to improve link recommendation accuracy via leveraging the power of Network Formation Games into machine learning models. We present two different approaches to solve this problem. First, we propose a three-phase method that straightforwardly combines game theoretic analysis with machine learning models. Second, we develop a unified model, BPRLGT, that incorporates Network Formation Game into a Bayesian ranking framework for link recommendation. Specifically, BPRLGT takes advantage of network topology and we design a game theoretic sampling approach to improve its training process. The experiments are conducted on four real world datasets and the results on all datasets demonstrate that both our proposed three-phase method and the unified ranking model outperform the baseline methods.
Extracting Interest Tags for Non-famous Users in Social Network BIBAFull-Text 861-870
  Wei He; Hongyan Liu; Jun He; Shu Tang; Xiaoyong Du
Inferring interests of users in social network is important for many applications such as personalized search, recommender systems and online advertising. Most previous studies inferred users' interests based on text posted in social network, which is usually not related to their interests. In this paper, we propose a modified topic model, Bi-Labeled LDA with a term weighting scheme, to extract interest tags for users in social network. The proposed model utilize only users' relationship information without requirement for text information, and incorporates supervision into traditional LDA. Specifically, we introduce method to extract tags for non-famous user through their relationship with famous users in Twitter, and study why a non-famous user follows famous users simultaneously. Comparison with state-of-the-art methods on real dataset shows that our method is far more superior in terms of precision and recall of the extracted tag set, and also more applicable for many personalized applications. Besides, we find that a reasonable term weighting scheme can actually improve the performance further.

Session 4F: Matrix Factorization

Robust Capped Norm Nonnegative Matrix Factorization: Capped Norm NMF BIBAFull-Text 871-880
  Hongchang Gao; Feiping Nie; Weidong Cai; Heng Huang
As an important matrix factorization model, Nonnegative Matrix Factorization (NMF) has been widely used in information retrieval and data mining research. Standard Nonnegative Matrix Factorization is known to use the Frobenius norm to calculate the residual, making it sensitive to noises and outliers. It is desirable to use robust NMF models for practical applications, in which usually there are many data outliers. It has been studied that the 2,1, or 1-norm can be used for robust NMF formulations to deal with data outliers. However, these alternatives still suffer from the extreme data outliers. In this paper, we present a novel robust capped norm orthogonal Nonnegative Matrix Factorization model, which utilizes the capped norm for the objective to handle these extreme outliers. Meanwhile, we derive a new efficient optimization algorithm to solve the proposed non-convex non-smooth objective. Extensive experiments on both synthetic and real datasets show our proposed new robust NMF method consistently outperforms related approaches.
MF-Tree: Matrix Factorization Tree for Large Multi-Class Learning BIBAFull-Text 881-890
  Lei Liu; Pang-Ning Tan; Xi Liu
Many big data applications require accurate classification of objects into one of possibly thousands or millions of categories. Such classification tasks are challenging due to issues such as class imbalance, high testing cost, and model interpretability problems. To overcome these challenges, we propose a novel hierarchical learning method known as MF-Tree to efficiently classify data sets with large number of classes while simultaneously inducing a taxonomy structure that captures relationships among the classes. Unlike many other existing hierarchical learning methods, our approach is designed to optimize a global objective function. We demonstrate the equivalence between our proposed regularized loss function and the Hilbert-Schmidt Independence Criterion (HSIC). The latter has a nice additive property, which allows us to decompose the multi-class learning problem into hierarchical binary classification tasks. To improve its training efficiency, an approximate algorithm for inducing MF-Tree is also proposed. We performed extensive experiments to compare MF-Tree against several state-of-the-art algorithms and showed both its effectiveness and efficiency when applied to real-world data sets.
GraRep: Learning Graph Representations with Global Structural Information BIBAFull-Text 891-900
  Shaosheng Cao; Wei Lu; Qiongkai Xu
In this paper, we present {GraRep}, a novel model for learning vertex representations of weighted graphs. This model learns low dimensional vectors to represent vertices appearing in a graph and, unlike existing work, integrates global structural information of the graph into the learning process. We also formally analyze the connections between our work and several previous research efforts, including the DeepWalk model of Perozzi et al. as well as the skip-gram model with negative sampling of Mikolov et al.
   We conduct experiments on a language network, a social network as well as a citation network and show that our learned global representations can be effectively used as features in tasks such as clustering, classification and visualization. Empirical results demonstrate that our representation significantly outperforms other state-of-the-art methods in such tasks.
Context-Adaptive Matrix Factorization for Multi-Context Recommendation BIBAFull-Text 901-910
  Tong Man; Huawei Shen; Junming Huang; Xueqi Cheng
Data sparsity is a long-standing challenge for recommender systems based on collaborative filtering. A promising solution for this problem is multi-context recommendation, i.e., leveraging users' explicit or implicit feedback from multiple contexts. In multi-context recommendation, various types of interactions between entities (users and items) are combined to alleviate data sparsity of a single context in a collective manner. Two issues are crucial for multi-context recommendation: (1) How to differentiate context-specific factors from entity-intrinsic factors shared across contexts? (2) How to capture the salient phenomenon that some entities are insensitive to contexts while others are remarkably context-dependent? Previous methods either do not consider context-specific factors, or assume that a context imposes equal influence on different entities, limiting their capability of combating data sparsity problem by taking full advantage of multiple contexts.
   In this paper, we propose a context-adaptive matrix factorization method for multi-context recommendation by simultaneously modeling context-specific factors and entity-intrinsic factors in a unified model. We learn an entity-intrinsic latent factor for every entity, and a context-specific latent factor for every entity in each context. Meanwhile, using a context-entity mixture parameter matrix we explicitly model the extent to which each context imposes influence on each entity. Experiments on two real scenarios demonstrate that our method consistently outperforms previous multi-context recommendation methods on all different sparsity levels.
   Such a consistent performance promotion forms the unique superiority of our method, enabling it to be a reliable model for multi-context recommendation.

Session 5A: Trips and Trajectories

Personalized Trip Recommendation with POI Availability and Uncertain Traveling Time BIBAFull-Text 911-920
  Chenyi Zhang; Hongwei Liang; Ke Wang; Jianling Sun
As location-based social network (LBSN) services become increasingly popular, trip recommendation that recommends a sequence of points of interest (POIs) to visit for a user emerges as one of many important applications of LBSNs. Personalized trip recommendation tailors to users' specific tastes by learning from past check-in behaviors of users and their peers. Finding the optimal trip that maximizes user's experiences for a given time budget constraint is an NP hard problem and previous solutions do not consider two practical and important constraints. One constraint is POI availability where a POI may be only available during a certain time window. Another constraint is uncertain traveling time where the traveling time between two POIs is uncertain. This work presents efficient solutions to personalized trip recommendation by incorporating these constraints to prune the search space. We evaluated the efficiency and effectiveness of our solutions on real life LBSN data sets.
Range Search on Uncertain Trajectories BIBAFull-Text 921-930
  Liming Zhan; Ying Zhang; Wenjie Zhang; Xiaoyang Wang; Xuemin Lin
The range search on trajectories is fundamental in a wide spectrum of applications such as environment monitoring and location based services. In practice, a large portion of spatio-temporal data in the above applications is generated with low sampling rate and the uncertainty arises between two subsequent observations of a moving object. To make sense of the uncertain trajectory data, it is critical to properly model the uncertainty of the trajectories and develop efficient range search algorithms on the new model. Assuming uncertain trajectories are modeled by the popular Markov Chains, in this paper we investigate the problem of range search on uncertain trajectories. In particular, we propose a general framework for range search on uncertain trajectories following the filtering-and-refinement paradigm where summaries of uncertain trajectories are constructed to facilitate the filtering process. Moreover, statistics based and partition based filtering techniques are developed to enhance the filtering capabilities. Comprehensive experiments demonstrate the effectiveness and efficiency of our new techniques.
Efficient Computation of Trips with Friends and Families BIBAFull-Text 931-940
  Tanzima Hashem; Sukarna Barua; Mohammed Eunus Ali; Lars Kulik; Egemen Tanin
A group of friends located at their working places may want to plan a trip to visit a shopping center, have dinner at a restaurant, watch a movie at a theater, and then finally return to their homes with the minimum total trip distance. For a group of spatially dispersed users a group trip planning (GTP) query returns points of interests (POIs) of different types such as a shopping center, a restaurant and a movie theater that minimize the aggregate trip distance for the group. The aggregate trip distance could be the sum or maximum of the trip distances of all users in the group, where the users travel from their source locations via the jointly visited POIs to their individual destinations. In this paper, we develop both optimal and approximation algorithms for GTP queries for both Euclidean space and road networks. Processing GTP queries in real time is a computational challenge as trips involve POIs of multiple types and computation of aggregate trip distances. We develop novel techniques to refine the POI search space for a GTP query based on geometric properties of ellipses, which in turn significantly reduces the number of aggregate trip distance computations. An extensive set of experiments on a real and synthetic datasets shows that our approach outperforms the most competitive approach on an average by three orders of magnitude in terms of processing time.
Sampling Big Trajectory Data BIBAFull-Text 941-950
  Yanhua Li; Chi-Yin Chow; Ke Deng; Mingxuan Yuan; Jia Zeng; Jia-Dong Zhang; Qiang Yang; Zhi-Li Zhang
The increasing prevalence of sensors and mobile devices has led to an explosive increase of the scale of spatio-temporal data in the form of trajectories. A trajectory aggregate query, as a fundamental functionality for measuring trajectory data, aims to retrieve the statistics of trajectories passing a user-specified spatio-temporal region. A large-scale spatio-temporal database with big disk-resident data takes very long time to produce exact answers to such queries. Hence, approximate query processing with a guaranteed error bound is a promising solution in many scenarios with stringent response-time requirements. In this paper, we study the problem of approximate query processing for trajectory aggregate queries. We show that it boils down to the distinct value estimation problem, which has been proven to be very hard with powerful negative results given that no index is built. By utilizing the well-established spatio-temporal index and introducing an inverted index to trajectory data, we are able to design random index sampling (RIS) algorithm to estimate the answers with a guaranteed error bound. To further improve system scalability, we extend RIS algorithm to concurrent random index sampling (CRIS) algorithm to process a number of trajectory aggregate queries arriving concurrently with overlapping spatio-temporal query regions. To demonstrate the efficacy and efficiency of our sampling and estimation methods, we applied them in a real large-scale user trajectory database collected from a cellular service provider in China. Our extensive evaluation results indicate that both RIS and CRIS outperform exhaustive search for single and concurrent trajectory aggregate queries by two orders of magnitude in terms of the query processing time, while preserving a relative error ratio lower than 10%, with only 1% search cost of the exhaustive search method.

Session 5B: Retrieval Enhancements 1

EsdRank: Connecting Query and Documents through External Semi-Structured Data BIBAFull-Text 951-960
  Chenyan Xiong; Jamie Callan
This paper presents EsdRank, a new technique for improving ranking using external semi-structured data such as controlled vocabularies and knowledge bases. EsdRank treats vocabularies, terms and entities from external data, as objects connecting query and documents. Evidence used to link query to objects, and to rank documents are incorporated as features between query-object and object-document correspondingly. A latent listwise learning to rank algorithm, Latent-ListMLE, models the objects as latent space between query and documents, and learns how to handle all evidence in a unified procedure from document relevance judgments. EsdRank is tested in two scenarios: Using a knowledge base for web search, and using a controlled vocabulary for medical search. Experiments on TREC Web Track and OHSUMED data show significant improvements over state-of-the-art baselines.
A Probabilistic Framework for Temporal User Modeling on Microblogs BIBAFull-Text 961-970
  Jitao Sang; Dongyuan Lu; Changsheng Xu
In social media, users have contributed enormous behavior data online which can be leveraged for user modeling and conduct personalized services. Temporal user modeling, which incorporates the timestamp of these behavior data and understands users' interest evolution, have attracted attention recently. With the recognition that user interests are vulnerable to transient events, many current temporal user modeling solutions propose to first identify the transient events and then consider the identified events into user behavior modeling. In this work, in the context of microblogs, we propose a unified probabilistic framework to simultaneously model the process of transient event detection and temporal user tweeting. The outputs of the framework include: (1) one long-term topic space spanning over general categories, (2) one short-term topic space for each time interval corresponding to the transient events, and (3) users' interest distributions over the long- and short-term topic spaces. Qualitative and quantitative experimental evaluation are conducted on a large-scale Twitter dataset, with more than 2 million users and 0.3 billion tweets. The promising results demonstrate the advantage of the proposed topic models.
Deriving Intensional Descriptions for Web Services BIBAFull-Text 971-980
  Maria Koutraki; Dan Vodislav; Nicoleta Preda
Many data providers make their data available through Web service APIs. In order to unleash the potential of these sources for intelligent applications, the data has to be combined across different APIs. However, due to the heterogeneity of schemas, the integration of different APIs remains a mainly manual task to date. In this paper, we model an API method as a view with binding patterns over a global RDF schema. We present an algorithm that can automatically infer the view definition of a method in the global schema. We also show how to compute transformation functions that can transform API call results into this schema. The key idea of our approach is to exploit the intersection of API call results with a knowledge base and with other call results. Our experiments on more than 50 real Web services show that we can automatically infer the schema with a precision of 81%-100%.
An Optimization Framework for Propagation of Query-Document Features by Query Similarity Functions BIBAFull-Text 981-990
  Maxim Zhukovskiy; Tsimafei Khatkevich; Gleb Gusev; Pavel Serdyukov
It is well known that a great number of query -- document features which significantly improve the quality of ranking for popular queries, however, do not provide any benefit for new or rare queries since there is typically not enough data associated with those queries that is required to reliably compute the values of those features. It is a common practice to propagate the values of such features from popular to tail queries, if the queries are similar according to some predefined query similarity functions. In this paper, we propose new algorithms that facilitate and increase the effectiveness of this propagation. Given a query similarity function and a query -- document relevance feature, we introduce two different approaches (linear weighting approach and tree-based approach) to learn a function of values of the similarity function and values of the feature for the similar queries w.r.t. the given document. The propagated value of the feature equals the value of the obtained function for the given query -- document pair.

Session 5C: Privacy

Rank Consistency based Multi-View Learning: A Privacy-Preserving Approach BIBAFull-Text 991-1000
  Han-Jia Ye; De-Chuan Zhan; Yuan Miao; Yuan Jiang; Zhi-Hua Zhou
Complex media objects are often described by multi-view feature groups collected from diverse domains or information channels. Multi-view learning, which attempts to exploit the relationship among multiple views to improve learning performance, has drawn extensive attention. It is noteworthy that in some real-world applications, features of different views may come from different private data repositories, and thus, it is desired to exploit view relationship with data privacy preserved simultaneously. Existing multi-view learning approaches such as subspace methods and pre-fusion methods are not applicable in this scenario because they need to access the whole features, whereas late-fusion approaches could not exploit information from other views to improve the individual view-specific learners. In this paper, we propose a novel multi-view learning framework which works in a hybrid fusion manner. Specifically, we convert predicted values of each view into an Accumulated Prediction Matrix (APM) with low-rank constraint enforced jointly by the multiple views. The joint low-rank constraint enables the view-specific learner to exploit other views to help improve the performance, without accessing the features of other views. Thus, the proposed RANC framework provides a privacy-preserving way for multi-view learning. Furthermore, we consider variants of solutions to achieve rank consistency and present corresponding methods for the optimization. Empirical investigations on real datasets show that the proposed method achieves state-of-the-art performance on various tasks.
Differentially Private Histogram Publication for Dynamic Datasets: an Adaptive Sampling Approach BIBAFull-Text 1001-1010
  Haoran Li; Li Xiong; Xiaoqian Jiang; Jinfei Liu
Differential privacy has recently become a de facto standard for private statistical data release. Many algorithms have been proposed to generate differentially private histograms or synthetic data. However, most of them focus on "one-time" release of a static dataset and do not adequately address the increasing need of releasing series of dynamic datasets in real time. A straightforward application of existing histogram methods on each snapshot of such dynamic datasets will incur high accumulated error due to the composibility of differential privacy and correlations or overlapping users between the snapshots. In this paper, we address the problem of releasing series of dynamic datasets in real time with differential privacy, using a novel adaptive distance-based sampling approach. Our first method, DSFT, uses a fixed distance threshold and releases a differentially private histogram only when the current snapshot is sufficiently different from the previous one, i.e., with a distance greater than a predefined threshold. Our second method, DSAT, further improves DSFT and uses a dynamic threshold adaptively adjusted by a feedback control mechanism to capture the data dynamics. Extensive experiments on real and synthetic datasets demonstrate that our approach achieves better utility than baseline methods and existing state-of-the-art methods.
WaveCluster with Differential Privacy BIBAFull-Text 1011-1020
  Ling Chen; Ting Yu; Rada Chirkova
WaveCluster is an important family of grid-based clustering algorithms that are capable of finding clusters of arbitrary shapes. In this paper, we investigate techniques to perform WaveCluster while ensuring differential privacy.
   Our goal is to develop a general technique for achieving differential privacy on WaveCluster that accommodates different wavelet transforms.
   We show that straightforward techniques based on synthetic data generation and introduction of random noise when quantizing the data, though generally preserving the distribution of data, often introduce too much noise to preserve useful clusters. We then propose two optimized techniques, PrivTHR and PrivTHRem, which can significantly reduce data distortion during two key steps of WaveCluster: the quantization step and the significant grid identification step. We conduct extensive experiments based on four datasets that are particularly interesting in the context of clustering, and show that PrivTHR and PrivTHRem achieve high utility when privacy budgets are properly allocated, conforming to our theoretical analysis.
Process-Driven Data Privacy BIBAFull-Text 1021-1030
  Weiyi Xia; Murat Kantarcioglu; Zhiyu Wan; Raymond Heatherly; Yevgeniy Vorobeychik; Bradley Malin
The quantity of personal data gathered by service providers via our daily activities continues to grow at a rapid pace. The sharing, and the subsequent analysis of, such data can support a wide range of activities, but concerns around privacy often prompt an organization to transform the data to meet certain protection models (e.g., k-anonymity or ε-differential privacy). These models, however, are based on simplistic adversarial frameworks, which can lead to both under- and over-protection. For instance, such models often assume that an adversary attacks a protected record exactly once. We introduce a principled approach to explicitly model the attack process as a series of steps. Specifically, we engineer a factored Markov decision process (FMDP) to optimally plan an attack from the adversary's perspective and assess the privacy risk accordingly. The FMDP captures the uncertainty in the adversary's belief (e.g., the number of identified individuals that match the de-identified data) and enables the analysis of various real world deterrence mechanisms beyond a traditional protection model, such as a penalty for committing an attack. We present an algorithm to solve the FMDP and illustrate its efficiency by simulating an attack on publicly accessible U.S. census records against a real identified resource of over 500,000 individuals in a voter registry. Our results demonstrate that while traditional privacy models commonly expect an adversary to attack exactly once per record, an optimal attack in our model may involve exploiting none, one, or more individuals in the pool of candidates, depending on context.

Session 5D: Data Streams

Unsupervised Feature Selection on Data Streams BIBAFull-Text 1031-1040
  Hao Huang; Shinjae Yoo; Shiva Prasad Kasiviswanathan
Massive data streams are continuously being generated from sources such as social media, broadcast news, etc., and typically these datapoints lie in high-dimensional spaces (such as the vocabulary space of a language). Timely and accurate feature subset selection in these massive data streams has important applications in model interpretation, computational/storage cost reduction, and generalization enhancement. In this paper, we introduce a novel unsupervised feature selection approach on data streams that selects important features by making only one pass over the data while utilizing limited storage. The proposed algorithm uses ideas from matrix sketching to efficiently maintain a low-rank approximation of the observed data and applies regularized regression on this approximation to identify the important features. We theoretically prove that our algorithm is close to an expensive offline approach based on global singular value decompositions. The experimental results on a variety of text and image datasets demonstrate the excellent ability of our approach to identify important features even in presence of concept drifts and also its efficiency over other popular scalable feature selection algorithms.
Unsupervised Streaming Feature Selection in Social Media BIBAFull-Text 1041-1050
  Jundong Li; Xia Hu; Jiliang Tang; Huan Liu
The explosive growth of social media sites brings about massive amounts of high-dimensional data. Feature selection is effective in preparing high-dimensional data for data analytics. The characteristics of social media present novel challenges for feature selection. First, social media data is not fully structured and its features are usually not predefined, but are generated dynamically. For example, in Twitter, slang words (features) are created everyday and quickly become popular within a short period of time. It is hard to directly apply traditional batch-mode feature selection methods to find such features. Second, given the nature of social media, label information is costly to collect. It exacerbates the problem of feature selection without knowing feature relevance. On the other hand, opportunities are also unequivocally present with additional data sources; for example, link information is ubiquitous in social media and could be helpful in selecting relevant features. In this paper, we study a novel problem to conduct unsupervised streaming feature selection for social media data. We investigate how to exploit link information in streaming feature selection, resulting in a novel unsupervised streaming feature selection framework USFS. Experimental results on two real-world social media datasets show the effectiveness and efficiency of the proposed framework comparing with the state-of-the-art unsupervised feature selection algorithms.
Weighted Similarity Estimation in Data Streams BIBAFull-Text 1051-1060
  Konstantin Kutzkov; Mohamed Ahmed; Sofia Nikitaki
Similarity computation between pairs of objects is often a bottleneck in many applications that have to deal with massive volumes of data. Motivated by applications such as collaborative filtering in large-scale recommender systems, and influence probabilities learning in social networks, we present new randomized algorithms for the estimation of weighted similarity in data streams.
   Previous works have addressed the problem of learning binary similarity measures in a streaming setting. To the best of our knowledge, the algorithms proposed here are the first that specifically address the estimation of weighted similarity in data streams. The algorithms need only one pass over the data, making them ideally suited to handling massive data streams in real time.
   We obtain precise theoretical bounds on the approximation error and complexity of the algorithms. The results of evaluating our algorithms on two real-life datasets validate the theoretical findings and demonstrate the applicability of the proposed algorithms.
Private Analysis of Infinite Data Streams via Retroactive Grouping BIBAFull-Text 1061-1070
  Rui Chen; Yilin Shen; Hongxia Jin
With the rapid advances in hardware technology, data streams are being generated daily in large volumes, enabling a wide range of real-time analytical tasks. Yet data streams from many sources are inherently sensitive, and thus providing continuous privacy protection in data streams has been a growing demand. In this paper, we consider the problem of private analysis of infinite data streams under differential privacy. We propose a novel data stream sanitization framework that periodically releases histograms summarizing the event distributions over sliding windows to support diverse data analysis tasks. Our framework consists of two modules, a sampling-based change monitoring module and a continuous histogram publication module. The monitoring module features an adaptive Bernoulli sampling process to accurately track the evolution of a data stream. We for the first time conduct error analysis of sampling under differential privacy, which allows to select the best sampling rate. The publication module features three different publishing strategies, including a novel technique called retroactive grouping to enjoy reduced noise. We provide theoretical analysis of the utility, privacy and complexity of our framework. Extensive experiments over real datasets demonstrate that our solution substantially outperforms the state-of-the-art competitors.

Session 5E: Classification 2

Parallel Lazy Semi-Naive Bayes Strategies for Effective and Efficient Document Classification BIBAFull-Text 1071-1080
  Felipe Viegas; Marcos André Gonçalves; Wellington Martins; Leonardo Rocha
Automatic Document Classification (ADC) is the basis of many important applications such as spam filtering and content organization. Naive Bayes (NB) approaches are a widely used classification paradigm, due to their simplicity, efficiency, absence of parameters and effectiveness. However, they do not present competitive effectiveness when compared to other modern statistical learning methods, such as SVMs. This is related to some characteristics of real document collections, such as class imbalance, feature sparseness and strong relationships among attributes. In this paper, we investigate whether the relaxation of the NB feature independence assumption (aka, Semi-NB approaches) can improve its effectiveness in large text collections. We propose four new Lazy Semi-NB strategies that exploit different ideas for alleviating the NB independence assumption. By being lazy, our solutions focus only on the most important features to classify a given test document, overcoming some Semi-NB issues when applied to ADC such as bias towards larger classes and overfitting and/or lack of generalization of the models. We demonstrate that our Lazy Semi-NB proposals can produce superior effectiveness when compared to state-of-the-art ADC classifiers such as SVM and KNN. Moreover, to overcome some efficiency issues of combining Semi-NB and lazy strategies, we take advantage of current manycore GPU architectures and present a massively parallelized version of the Semi-NB approaches. Our experimental results show that speedups of up to 63.36 times can be obtained when compared to serial solutions, making our proposals very practical in real-situations.
A Novel Class Noise Estimation Method and Application in Classification BIBAFull-Text 1081-1090
  Lin Gui; Qin Lu; Ruifeng Xu; Minglei Li; Qikang Wei
Noise in class labels of any training set can lead to poor classification results no matter what machine learning method is used. In this paper, we first present the problem of binary classification in the presence of random noise on the class labels, which we call class noise. To model class noise, a class noise rate is normally defined as a small independent probability of the class labels being inverted on the whole set of training data. In this paper, we propose a method to estimate class noise rate at the level of individual samples in real data. Based on the estimation result, we propose two approaches to handle class noise. The first technique is based on modifying a given surrogate loss function. The second technique eliminates class noise by sampling. Furthermore, we prove that the optimal hypothesis on the noisy distribution can approximate the optimal hypothesis on the clean distribution using both approaches. Our methods achieve over 87% accuracy on a synthetic non-separable dataset even when 40% of the labels are inverted. Comparisons to other algorithms show that our methods outperform state-of-the-art approaches on several benchmark datasets in different domains with different noise rates.
Learning Task Grouping using Supervised Task Space Partitioning in Lifelong Multitask Learning BIBAFull-Text 1091-1100
  Meenakshi Mishra; Jun Huan
Lifelong multitask learning is a multitask learning framework in which a learning agent faces the tasks that need to be learnt in an online manner. Lifelong multitask learning framework may be applied to a variety of applications such as image annotation, robotics, automated machines etc, and hence, may prove to be a highly promising direction for further investigation. However, the lifelong learning framework comes with its own baggage of challenges. The biggest challenge is the fact that the characteristics of the future tasks which might be encountered by the learning agents are entirely unknown. If all the tasks are assumed to be related, there may be a risk of training from unrelated task resulting in negative transfer of information. To overcome this problem, both batch and online multitask learning algorithms learn task relationships. However, due to the unknown nature of the future tasks, learning the task relationships is also difficult in lifelong multitask learning. In this paper, we propose learning functions to model the task relationships as it is computationally cheaper in an online setting. More specifically, we learn partition functions in the task space to divide the tasks into cluster. Our major contribution is to present a global formulation to learn both the task partitions and the parameters. We provide a supervised learning framework to estimate both the partition function and the model. The current method has been implemented and compared against other leading lifelong learning algorithms using several real world datasets, and we show that the current method has a superior performance.
KSGM: Keynode-driven Scalable Graph Matching BIBAFull-Text 1101-1110
  Xilun Chen; K. Selçuk Candan; Maria Luisa Sapino; Paulo Shakarian
Understanding how a given pair of graphs align with each other (also known as the graph matching problem) is a critical task in many search, classification, and analysis applications. Unfortunately, the problem of maximum common subgraph isomorphism between two graphs is a well known NP-hard problem, rendering it impractical to search for exact graph alignments. While there are several heuristics, most of these analyze and encode global and local structural information for every node of the graph and then rank pairs of nodes across the two graphs based on their structural similarities. Moreover, many algorithms involve a post-processing (or refinement) step which aims to improve the initial matching accuracy. In this paper we note that the expensive refinement phase of graph matching algorithms is not practical in any application where scalability is critical. It is also impractical to seek structural similarity between all pairs of nodes. We argue that a more practical and scalable solution is to seek structural keynodes of the input graphs that can be used to limit the amount of time needed to search for alignments. Naturally, these keynodes need to be selected carefully to prevent any degradations in accuracy during the alignment process. Given this motivation, in this paper, we first present a structural keynode extraction (SKE) algorithm and then use structural keynodes obtained during off-line processing for keynode-driven scalable graph matching (KSGM). Experiments show that the proposed keynode-driven scalable graph matching algorithms produce alignments that are as accurate as (or better than) the state-of-the-art algorithms, with significantly faster online executions.

Session 5F: Sentiment and Content Analysis

Protecting Your Children from Inappropriate Content in Mobile Apps: An Automatic Maturity Rating Framework BIBAFull-Text 1111-1120
  Bing Hu; Bin Liu; Neil Zhenqiang Gong; Deguang Kong; Hongxia Jin
Mobile applications (Apps) could expose children or adolescents to mature themes such as sexual content, violence and drug use, which results in an inappropriate security and privacy risk for them. Therefore, mobile platforms provide rating policies to label the maturity levels of Apps and the reasons why an App has a given maturity level, which enables parents to select maturity-appropriate Apps for their children. However, existing approaches to implement these maturity rating policies are either costly (because of expensive manually labeling) or inaccurate (because of no centralized controls). In this work, we aim to design and build a machine learning framework to automatically predict maturity levels for mobile Apps and the associated reasons with a high accuracy and a low cost.
   To this end, we take a multi-label classification approach to predict the mature contents in a given App and then label the maturity level according to a rating policy. Specifically, we extract novel features from App descriptions by leveraging deep learning technique to automatically capture the semantic similarity of pairwise words and adapt Support Vector Machine to capture label correlations with pearson correlation in a multi-label classification setting. Moreover, we evaluate our approach and various baseline methods using datasets that we collected from both App Store and Google Play. We demonstrate that, with only App descriptions, our approach already achieves 85% Precision for predicting mature contents and 79% Precision for predicting maturity levels, which substantially outperforms baseline methods.
The Role of Query Sessions in Interpreting Compound Noun Phrases BIBAFull-Text 1121-1129
  Marius Pasca
The meaning of compound noun phrases can be approximated in the form of lexical interpretations extracted from text. The interpretations hint at the role that modifiers play relative to heads within the noun phrases. In a study examining the role of query sessions in explaining compound noun phrases, candidate interpretations of compound noun phrases are extracted from pairs of queries that belong to the same query session. Experimental results over multiple evaluation sets of noun phrases show a higher accuracy of the interpretations when extracted from query sessions rather than from individual queries.
Deep Semantic Frame-Based Deceptive Opinion Spam Analysis BIBAFull-Text 1131-1140
  Seongsoon Kim; Hyeokyoon Chang; Seongwoon Lee; Minhwan Yu; Jaewoo Kang
User-generated content is becoming increasingly valuable to both individuals and businesses due to its usefulness and influence in e-commerce markets. As consumers rely more on such information, posting deceptive opinions, which can be deliberately used for potential profit, is becoming more of an issue. Existing work on opinion spam detection focuses mainly on linguistic features such as n-grams, syntactic patterns, or LIWC. However, deep semantic analysis remains largely unstudied. In this paper, we propose a frame-based deep semantic analysis method for understanding rich characteristics of deceptive and truthful opinions written by various types of individuals including crowdsourcing workers, employees who have expert-level domain knowledge about local businesses, and online users who post on Yelp and TripAdvisor. Using our proposed semantic frame feature, we developed a classification model that outperforms the baseline model and achieves an accuracy of nearly 91%. Also, we performed qualitative analysis of deceptive and truthful review datasets and considered their semantic differences. Finally, we successfully found some interesting features that existing methods were unable to identify.
Topic Modeling in Semantic Space with Keywords BIBAFull-Text 1141-1150
  Xiaojia Pu; Rong Jin; Gangshan Wu; Dingyi Han; Gui-Rong Xue
A common and convenient approach for user to describe his information need is to provide a set of keywords. Therefore, the technique to understand the need becomes crucial. In this paper, for the information need about a topic or category, we propose a novel method called TDCS (Topic Distilling with Compressive Sensing) for explicit and accurate modeling the topic implied by several keywords. The task is transformed as a topic reconstruction problem in the semantic space with a reasonable intuition that the topic is sparse in the semantic space. The latent semantic space could be mined from documents via unsupervised methods, e.g. LSI. Compressive sensing is leveraged to obtain a sparse representation from only a few keywords. In order to make the distilled topic more robust, an iterative learning approach is adopted. The experiment results show the effectiveness of our method. Moreover, with only a few semantic concepts remained for the topic, our method is efficient for subsequent text mining tasks.

Session 6A: Time Series and Streams

F1: Accelerating the Optimization of Aggregate Continuous Queries BIBAFull-Text 1151-1160
  Anatoli U. Shein; Panos K. Chrysanthis; Alexandros Labrinidis
Data Stream Management Systems performing on-line analytics rely on the efficient execution of large numbers of Aggregate Continuous Queries (ACQs). The state-of-the-art WeaveShare optimizer uses the Weavability concept in order to selectively combine ACQs for partial aggregation and produce high quality execution plans. However, WeaveShare does not scale well with the number of ACQs. In this paper we propose a novel closed formula, F1, that accelerates Weavability calculations, and thus allows WeaveShare to achieve exceptional scalability in systems with heavy workloads. In general, F1 can reduce the computation time of any technique that combines partial aggregations within composite slides of multiple ACQs. We theoretically analyze the Bit Set approach currently used by WeaveShare and show that F1 is superior in both time and space complexities. We show that F1 performs 1062 times less operations compared to Bit Set to produce the same execution plan for the same input. We experimentally show that F1 executes up to 60,000 times faster and can handle 1,000,000 ACQs in a setting where the limit for the current technique is 550.
Fast Distributed Correlation Discovery Over Streaming Time-Series Data BIBAFull-Text 1161-1170
  Tian Guo; Saket Sathe; Karl Aberer
The dramatic rise of time-series data in a variety of contexts, such as social networks, mobile sensing, data centre monitoring, etc., has fuelled interest in obtaining real-time insights from such data using distributed stream processing systems. One such extremely valuable insight is the discovery of correlations in real-time from large-scale time-series data. A key challenge in discovering correlations is that the number of time-series pairs that have to be analyzed grows quadratically in the number of time-series, giving rise to a quadratic increase in both computation cost and communication cost between the cluster nodes in a distributed environment. To tackle the challenge, we propose a framework called AEGIS. AEGIS exploits well-established statistical properties to dramatically prune the number of time-series pairs that have to be evaluated for detecting interesting correlations. Our extensive experimental evaluations on real and synthetic datasets establish the efficacy of AEGIS over baselines.
Time Series Analysis of Nursing Notes for Mortality Prediction via a State Transition Topic Model BIBAFull-Text 1171-1180
  Yohan Jo; Natasha Loghmanpour; Carolyn Penstein Rosé
Accurate mortality prediction is an important task in intensive care units in order to channel prompt care to patients in the most critical condition and to reduce nurses' alarm fatigue. Nursing notes carry valuable information in this regard, but nothing has been reported about the effectiveness of temporal analysis of nursing notes in mortality prediction tasks.
   We propose a time series model that uncovers the temporal dynamics of patients' underlying states from nursing notes. The effectiveness of this information in mortality prediction is examined for mortality prediction for five different time spans ranging from one day to one year. Our experiments show that the model captures both patient states and their temporal dynamics that have a strong correlation with patient mortality. The results also show that incorporating temporal information improves performance in long-term mortality prediction, but has no significant effect in short-term prediction.

Session 6B: Adaptive Learning

Learning Relative Similarity from Data Streams: Active Online Learning Approaches BIBAFull-Text 1181-1190
  Shuji Hao; Peilin Zhao; Steven C. H. Hoi; Chunyan Miao
Relative similarity learning, as an important learning scheme for information retrieval, aims to learn a bi-linear similarity function from a collection of labeled instance-pairs, and the learned function would assign a high similarity value for a similar instance-pair and a low value for a dissimilar pair. Existing algorithms usually assume the labels of all the pairs in data streams are always made available for learning. However, this is not always realistic in practice since the number of possible pairs is quadratic to the number of instances in the database, and manually labeling the pairs could be very costly and time consuming. To overcome the limitation, we propose a novel framework of active online similarity learning. Specifically, we propose two new algorithms: (i) PAAS: Passive-Aggressive Active Similarity learning; (ii) CWAS: Confidence-Weighted Active Similarity learning, and we will prove their mistake bounds in theory. We have conducted extensive experiments on a variety of real-world data sets, and we find encouraging results that validate the empirical effectiveness of the proposed algorithms.
Ad Hoc Monitoring of Vocabulary Shifts over Time BIBAFull-Text 1191-1200
  Tom Kenter; Melvin Wevers; Pim Huijnen; Maarten de Rijke
Word meanings change over time. Detecting shifts in meaning for particular words has been the focus of much research recently. We address the complementary problem of monitoring shifts in vocabulary over time. That is, given a small seed set of words, we are interested in monitoring which terms are used over time to refer to the underlying concept denoted by the seed words.
   In this paper, we propose an algorithm for monitoring shifts in vocabulary over time, given a small set of seed terms. We use distributional semantic methods to infer a series of semantic spaces over time from a large body of time-stamped unstructured textual documents. We construct semantic networks of terms based on their representation in the semantic spaces and use graph-based measures to calculate saliency of terms. Based on the graph-based measures we produce ranked lists of terms that represent the concept underlying the initial seed terms over time as final output.
   As the task of monitoring shifting vocabularies over time for an ad hoc set of seed words is, to the best of our knowledge, a new one, we construct our own evaluation set. Our main contributions are the introduction of the task of ad hoc monitoring of vocabulary shifts over time, the description of an algorithm for tracking shifting vocabularies over time given a small set of seed words, and a systematic evaluation of results over a substantial period of time (over four decades). Additionally, we make our newly constructed evaluation set publicly available.
Balancing Novelty and Salience: Adaptive Learning to Rank Entities for Timeline Summarization of High-impact Events BIBAFull-Text 1201-1210
  Tuan A. Tran; Claudia Niederee; Nattiya Kanhabua; Ujwal Gadiraju; Avishek Anand
Long-running, high-impact events such as the Boston Marathon bombing often develop through many stages and involve a large number of entities in their unfolding. Timeline summarization of an event by key sentences eases story digestion, but does not distinguish between what a user remembers and what she might want to re-check. In this work, we present a novel approach for timeline summarization of high-impact events, which uses entities instead of sentences for summarizing the event at each individual point in time. Such entity summaries can serve as both (1) important memory cues in a retrospective event consideration and (2) pointers for personalized event exploration. In order to automatically create such summaries, it is crucial to identify the "right" entities for inclusion. We propose to learn a ranking function for entities, with a dynamically adapted trade-off between the in-document salience of entities and the informativeness of entities across documents, i.e., the level of new information associated with an entity for a time point under consideration. Furthermore, for capturing collective attention for an entity we use an innovative soft labeling approach based on Wikipedia. Our experiments on a real large news datasets confirm the effectiveness of the proposed methods.

Session 6C: Points-of-Interest

Location-Based Influence Maximization in Social Networks BIBAFull-Text 1211-1220
  Tao Zhou; Jiuxin Cao; Bo Liu; Shuai Xu; Ziqing Zhu; Junzhou Luo
In this paper, we aim at the product promotion in O2O model and carry out the research of location-based influence maximization on the platform of LBSN. As offline consuming behavior exists under the O2O environment, the traditional online influence diffusion model could not describe the product acceptance accurately. Moreover, the existing researches of influence maximization tend to only concern on the online network of relationships but rarely take the offline part into consideration. This paper introduces the location property into the influence maximization to accord with the characteristic of O2O model. Firstly, we propose an improved influence diffusion model called TP Model which could accurately describe the process of accepting products under the O2O environment. Meanwhile, the definition of location-based influence maximization is presented. Then the user mobility pattern is analyzed and the calculation method of offline probability is designed. Considering the influence ability, a location-based influence maximization algorithm named TPH is proposed. Experiments prove TPH algorithm has general advantage. Finally, focusing on the performance of TPH algorithm under special circumstances, MR algorithm is designed as complement and experiments also verify its high effectiveness.
Location and Time Aware Social Collaborative Retrieval for New Successive Point-of-Interest Recommendation BIBAFull-Text 1221-1230
  Wei Zhang; Jianyong Wang
In location-based social networks (LBSNs), new successive point-of-interest (POI) recommendation is a newly formulated task which tries to regard the POI a user currently visits as his POI-related query and recommend new POIs the user has not visited before. While carefully designed methods are proposed to solve this problem, they ignore the essence of the task which involves retrieval and recommendation problem simultaneously and fail to employ the social relations or temporal information adequately to improve the results.
   In order to solve this problem, we propose a new model called location and time aware social collaborative retrieval model (LTSCR), which has two distinct advantages: (1) it models the location, time, and social information simultaneously for the successive POI recommendation task; (2) it efficiently utilizes the merits of the collaborative retrieval model which leverages weighted approximately ranked pairwise (WARP) loss for achieving better top-n ranking results, just as the new successive POI recommendation task needs. We conducted some comprehensive experiments on publicly available datasets and demonstrate the power of the proposed method, with 46.6% growth in Precision@5 and 47.3% improvement in Recall@5 over the best previous method.
Where you Instagram?: Associating Your Instagram Photos with Points of Interest BIBAFull-Text 1231-1240
  Xutao Li; Tuan-Anh Nguyen Pham; Gao Cong; Quan Yuan; Xiao-Li Li; Shonali Krishnaswamy
Instagram, an online photo-sharing platform, has gained increasing popularity. It allows users to take photos, apply digital filters and share them with friends instantaneously by using mobile devices.
   Instagram provides users with the functionality to associate their photos with points of interest, and it thus becomes feasible to study the association between points of interest and Instagram photos. However, no previous work studies the association. In this paper, we propose to study the problem of mapping Instagram photos to points of interest. To understand the problem, we analyze Instagram datasets, and report our findings, which also characterize the challenges of the problem. To address the challenges, we propose to model the mapping problem as a ranking problem, and develop a method to learn a ranking function by exploiting the textual, visual and user information of photos. To maximize the prediction effectiveness for textual and visual information, and incorporate the users' visiting preferences, we propose three subobjectives for learning the parameters of the proposed ranking function. Experimental results on two sets of Instagram data show that the proposed method substantially outperforms existing methods that are adapted to handle the problem.

Session 6D: Matrices

Gradient-based Signatures for Efficient Similarity Search in Large-scale Multimedia Databases BIBAFull-Text 1241-1250
  Christian Beecks; Merih Seran Uysal; Judith Hermanns; Thomas Seidl
With the continuous rise of multimedia, the question of how to access large-scale multimedia databases efficiently has become of crucial importance. Given a multimedia database comprising millions of multimedia objects, how to approximate the content-based properties of the corresponding feature representations in order to carry out similarity search efficiently and with high accuracy? In this paper, we propose the concept of gradient-based signatures in order to aggregate content-based features of multimedia objects by means of generative models. We provide theoretical insights into our approach including closed-form expressions for the computation of gradient-based signatures with respect to Gaussian mixture models and additionally investigate different binarization methods for gradient-based signatures in order to query databases comprising millions of multimedia objects with high accuracy in less than one second.
Cross-Modal Similarity Learning: A Low Rank Bilinear Formulation BIBAFull-Text 1251-1260
  Cuicui Kang; Shengcai Liao; Yonghao He; Jian Wang; Wenjia Niu; Shiming Xiang; Chunhong Pan
The cross-media retrieval problem has received much attention in recent years due to the rapid increasing of multimedia data on the Internet. A new approach to the problem has been raised which intends to match features of different modalities directly. In this research, there are two critical issues: how to get rid of the heterogeneity between different modalities and how to match the cross-modal features of different dimensions. Recently metric learning methods show a good capability in learning a distance metric to explore the relationship between data points. However, the traditional metric learning algorithms only focus on single-modal features, which suffer difficulties in addressing the cross-modal features of different dimensions. In this paper, we propose a cross-modal similarity learning algorithm for the cross-modal feature matching. The proposed method takes a bilinear formulation, and with the nuclear-norm penalization, it achieves low-rank representation. Accordingly, the accelerated proximal gradient algorithm is successfully imported to find the optimal solution with a fast convergence rate O(1/t2). Experiments on three well known image-text cross-media retrieval databases show that the proposed method achieves the best performance compared to the state-of-the-art algorithms.
Efficient Sparse Matrix Multiplication on GPU for Large Social Network Analysis BIBAFull-Text 1261-1270
  Yong-Yeon Jo; Sang-Wook Kim; Duck-Ho Bae
As a number of social network services appear online recently, there have been many attempts to analyze social networks for extracting valuable information. Most existing methods first represent a social network as a quite sparse adjacency matrix, and then analyze it through matrix operations such as matrix multiplication. Due to the large scale and high complexity, efficient processing multiplications is an important issue in social network analysis. In this paper, we propose a GPU-based method for efficient sparse matrix multiplication through the parallel computing paradigm. The proposed method aims at balancing the amount of workload both at fine- and coarse-grained levels for maximizing the degree of parallelism in GPU. Through extensive experiments using synthetic and real-world datasets, we show that the proposed method outperforms previous methods by up to three orders-of-magnitude.

Session 6E: Citation Networks

The Role Of Citation Context In Predicting Long-Term Citation Profiles: An Experimental Study Based On A Massive Bibliographic Text Dataset BIBAFull-Text 1271-1280
  Mayank Singh; Vikas Patidar; Suhansanu Kumar; Tanmoy Chakraborty; Animesh Mukherjee; Pawan Goyal
The impact and significance of a scientific publication is measured mostly by the number of citations it accumulates over the years. Early prediction of the citation profile of research articles is a significant as well as challenging problem. In this paper, we argue that features gathered from the citation contexts of the research papers can be very relevant for citation prediction. Analyzing a massive dataset of nearly 1.5 million computer science articles and more than 26 million citation contexts, we show that average countX (number of times a paper is cited within the same article) and average citeWords (number of words within the citation context) discriminate between various citation ranges as well as citation categories. We use these features in a stratified learning framework for future citation prediction. Experimental results show that the proposed model significantly outperforms the existing citation prediction models by a margin of 8-10% on an average under various experimental settings. Specifically, the features derived from the citation context help in predicting long-term citation behavior.
Discovering Canonical Correlations between Topical and Topological Information in Document Networks BIBAFull-Text 1281-1290
  Yuan He; Cheng Wang; Changjun Jiang
Document network is a kind of intriguing dataset which can provide both topical (textual content) and topological (relational link) information. A key point in viably modeling such datasets is to discover proper denominators beneath the two different types of data, text and link. Most previous work introduces the assumption that documents closely linked with each other share common latent topics. However, the heterophily (i.e., tendency to link to different others) of nodes is neglected, which is pervasive in social networks. In this paper, we simultaneously incorporate community detection and topic modeling in a unified framework, and appeal to Canonical Correlation Analysis (CCA) to capture the latent semantic correlations between the two heterogeneous latent factors, community and topic. Despite of the homophily (i.e., tendency to link to similar others) or heterophily, CCA can properly capture the inherent correlations which fit the dataset itself without any prior hypothesis. Logistic normal prior is also employed in modeling network to better capture the community correlations. We derive efficient inference and learning algorithms based on variational EM methods. The effectiveness of our proposed model is comprehensively verified on three different types of datasets which are namely hyperlinked networks of web pages, social networks of friends and coauthor networks of publications. Experimental results show that our approach achieves significant improvements on both topic modeling and community detection compared with the current state of the art. Meanwhile, our model is impressive in discovering correlations between extracted topics and communities.
Chronological Citation Recommendation with Information-Need Shifting BIBAFull-Text 1291-1300
  Zhuoren Jiang; Xiaozhong Liu; Liangcai Gao
As the volume of publications has increased dramatically, an urgent need has developed to assist researchers in locating high-quality, candidate-cited papers from a research repository. Traditional scholarly-recommendation approaches ignore the chronological nature of citation recommendations. In this study, we propose a novel method called "Chronological Citation Recommendation" which assumes initial user information needs could shift while users are searching for papers in different time slices. We model the information-need shifts with two-level modeling: dynamic time-related ranking feature construction and dynamic evolving feature weight training. In more detail, we employed a supervised document influence model to characterize the content "time-varying" dynamics and constructed a novel heterogeneous graph that encapsulates dynamic topic-based information, time-decay paper/topic citation information, and word-based information. We applied multiple meta-paths for different ranking hypotheses which carried different types of information for citation recommendation in various time slices, along with information-need shifting. We also used multiple learning-to-rank models to optimize the feature weights for different time slices to generate the final "Chronological Citation Recommendation" rankings. The use of Chronological Citation Recommendation suggests time-series ranking lists based on initial user textual information need and characterizes the information-need shifting. Experiments on the ACM corpus show that Chronological Citation Recommendation can significantly enhance citation recommendation performance.

Session 6F: Knowledge Bases

Answering Questions with Complex Semantic Constraints on Open Knowledge Bases BIBAFull-Text 1301-1310
  Pengcheng Yin; Nan Duan; Ben Kao; Junwei Bao; Ming Zhou
A knowledge-based question-answering system (KB-QA) is one that answers natural language questions with information stored in a large-scale knowledge base (KB). Existing KB-QA systems are either powered by curated KBs in which factual knowledge is encoded in entities and relations with well-structured schemas, or by open KBs, which contain assertions represented in the form of triples (e.g., subject; relation phrase; argument). We show that both approaches fall short in answering questions with complex prepositional or adverbial constraints. We propose using n-tuple assertions, which are assertions with an arbitrary number of arguments, and n-tuple open KB (nOKB), which is an open knowledge base of n-tuple assertions. We present TAQA, a novel KB-QA system that is based on an nOKB and illustrate via experiments how TAQA can effectively answer complex questions with rich semantic constraints. Our work also results in a new open KB containing 120M n-tuple assertions and a collection of 300 labeled complex questions, which is made publicly available for further research.
Inducing Space Dirichlet Process Mixture Large-Margin Entity Relationship Inference in Knowledge Bases BIBAFull-Text 1311-1320
  Sotirios P. Chatzis
In this paper, we focus on the problem of extending a given knowledge base by accurately predicting additional true facts based on the facts included in it. This is an essential problem of knowledge representation systems, since knowledge bases typically suffer from incompleteness and lack of ability to reason over their discrete entities and relationships. To achieve our goals, in our work we introduce an inducing space nonparametric Bayesian large-margin inference model, capable of reasoning over relationships between pairs of entities. Previous works addressing the entity relationship inference problem model each entity based on atomic entity vector representations. In contrast, our method exploits word feature vectors to directly obtain high-dimensional nonlinear inducing space representations for entity pairs. This way, we allow for extracting salient latent characteristics and interaction dynamics within entity pairs that can be useful for inferring their relationships. On this basis, our model performs the relations inference task by postulating a set of binary Dirichlet process mixture large-margin classifiers, presented with the derived inducing space representations of the considered entity pairs. Bayesian inference for this inducing space model is performed under the mean-field inference paradigm. This is made possible by leveraging a recently proposed latent variable formulation of regularized large-margin classifiers that facilitates mean-field parameter estimation. We exhibit the superiority of our approach over the state-of-the-art by considering the problem of predicting additional true relations between entities given subsets of the WordNet and FreeBase knowledge bases.
Semi-Automated Exploration of Data Warehouses BIBAFull-Text 1321-1330
  Thibault Sellam; Emmanuel Müller; Martin Kersten
Exploratory data analysis tries to discover novel dependencies and unexpected patterns in large databases. Traditionally, this process is manual and hypothesis-driven. However, analysts can come short of patience and imagination. In this paper, we introduce Claude, a hypothesis generator for data warehouses. Claude follows a 2-step approach: (1) It detects interesting views, by exploiting non-linear statistical dependencies between the dimensions and the measure. (2) To explain its findings, it detects local patterns in these views and describes them with SQL queries. Technically, we derive a model of interestingness from fundamental information theory. To exploit this model, we present aggressive approximations and heuristics, allowing Claude to be fast and more accurate than state-of-art view selection algorithms.
Large-scale Knowledge Base Completion: Inferring via Grounding Network Sampling over Selected Instances BIBAFull-Text 1331-1340
  Zhuoyu Wei; Jun Zhao; Kang Liu; Zhenyu Qi; Zhengya Sun; Guanhua Tian
Constructing large-scale knowledge bases has attracted much attention in recent years, for which Knowledge Base Completion (KBC) is a key technique. In general, inferring new facts in a large-scale knowledge base is not a trivial task. The large number of inferred candidate facts has resulted in the failure of the majority of previous approaches. Inference approaches can achieve high precision for formulas that are accurate, but they are required to infer candidate instances one by one, and extremely large candidate sets bog them down in expensive calculations. In contrast, the existing embedding-based methods can easily calculate similarity-based scores for each candidate instance as opposed to using inference, so they can handle large-scale data. However, this type of method does not consider explicit logical semantics and usually has unsatisfactory precision. To resolve the limitations of the above two types of methods, we propose an approach through Inferring via Grounding Network Sampling over Selected Instances. We first employ an embedding-based model to make the instance selection and generate much smaller candidate sets for subsequent fact inference, which not only narrows the candidate sets but also filters out part of the noise instances. Then, we only make inferences within these candidate sets by running a data-driven inference algorithm on the Markov Logic Network (MLN), which is called Inferring via Grounding Network Sampling (INS). In this process, we especially incorporate the similarity priori generated by embedding-based models into INS to promote the inference precision. The experimental results show that our approach improved Hits@1 from 32.911% to 71.692% on the FB15K dataset and achieved much better AP@n evaluations than state-of-the-art methods.

Keynote Address III

Large-Scale Analysis of Dynamics of Choice Among Discrete Alternatives BIBAFull-Text 1349
  Andrew Tomkins
The online world is rife with scenarios in which a user must select one from a finite set of alternatives: which movie to watch, which song to play, which camera to order, which website to visit. There is a long history of study of these types of questions in economics, machine learning, marketing, and psychology. However, historically the study of choice was limited to relatively modest data scales. Today, we have access to large-scale datasets providing insights into the choices of large populations of users faced with a wide variety of sets of alternatives. From such data, we are beginning to develop more detailed models of how users weigh alternatives and make selections.
   I'll begin this talk by covering basic models for choice, along with some standard assumptions made by these models. I'll then cover some more recent work on understanding choice in modern datasets.
   First, I'll discuss choice in a geographic setting, in which users make selections of restaurants. Features in this setting provide visibility into the actual time cost to travel to one alternative versus another. In addition to this real cost, we may also tease out the implications on likelihood as the set of alternatives at a particular distance becomes larger or smaller based on the local density of restaurants. The marginals of these factors provide poor quality compared to joint models suggested by choice theory.
   From choice of individual restaurants, I'll then move to a setting in which users consume items repeatedly: repeated purchases of a particular brand of product; repeated listens to a song; repeated visits to a favorite coffee shop; and so forth. I'll describe a simple but accurate model for this problem whose behavior moves between two regimes based on parameter choices: in one regime, users settle on particular favorites and maintain them over time; in another regime, even the most favored item will eventually be abandoned after finite reconsumptions.
   Finally, I'll move to a more complex scenario of sequential consumption of a range of items, and will show how the theory of discrete choice can be incorporated into the theory of Markov processes, requiring a new algorithmic approach to learning the optimal solution.
   In all of these instances, the learned models include a per-item quality score that may be viewed as proportional to the residual likelihood of selecting the item after other factors in the choice slate have been accounted for.
   The work described in this talk is partly due to other researchers, and partly joint with various colleagues including Ashton Anderson, Ravi Kumar, Mohammad Mahdian, Bo Pang, Sergei Vassilvitskii and Erik Vee.

Session 7A: Database Optimization

On Gapped Set Intersection Size Estimation BIBAFull-Text 1351-1360
  Chen Chen; Jianbin Qin; Wei Wang
There exists considerable literature on estimating the cardinality of set intersection result. In this paper, we consider a generalized problem for integer sets where, given a gap parameter δ, two elements are deemed as matches if their numeric difference equals δ or is within δ. We call this problem the gapped set intersection size estimation (GSISE/), and it can be used to model applications in database systems, data mining, and information retrieval. We first distinguish two subtypes of the estimation problem: the point gap estimation and range gap estimation. We propose optimized sketches to tackle the two problems efficiently and effectively with theoretical guarantees. We demonstrate the usage of our proposed techniques in mining top-K related keywords efficiently, by integrating with an inverted index. Finally, substantial experiments based on a large subset of the ClueWed09 dataset demonstrate the efficiency and effectiveness of the proposed methods.
Inclusion Dependencies Reloaded BIBAFull-Text 1361-1370
  Henning Köhler; Sebastian Link
Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications such as data cleaning and profiling, entity resolution and schema matching. Surprisingly, the implication problem of inclusion dependencies has not been investigated in the context of SQL, the de-facto industry standard. Codd's relational model of data represents the idealized special case of SQL in which all attributes are declared NOT NULL. Driven by the SQL standard recommendation, we investigate inclusion dependencies and NOT NULL constraints under simple and partial semantics. Partial semantics is not natively supported by any SQL implementation but we show how classical results on the implication problem carry over into this context. Interestingly, simple semantics is natively supported by every SQL implementation, but we show that the implication problem is not finitely axiomatizable in this context. Resolving this conundrum we establish an optimal solution by identifying the desirable class of not-null inclusion dependencies (NNINDs) that subsumes simple and partial semantics as special cases, and whose associated implication problem has the same computational properties as inclusion dependencies in the relational model. That is, NNIND implication is 2-ary axiomatizable and PSPACE-complete to decide. Our proof techniques bring also forward a chase procedure for deciding NNIND implication, the NP-hard subclass of typed acyclic NNINDs, and the tractable subclasses of NNINDs whose arity is bounded.
Comprehensible Models for Reconfiguring Enterprise Relational Databases to Avoid Incidents BIBAFull-Text 1371-1380
  Ioana Giurgiu; Mirela Botezatu; Dorothea Wiesmann
Configuring enterprise database management systems is a notoriously hard problem. The combinatorial parameter space makes it intractable to run and observe the DBMS behavior in all scenarios. Thus, the database administrator has the difficult task of choosing DBMS configurations that potentially lead to critical incidents, thus hindering its availability or performance. We propose using machine learning to understand how configuring a DBMS can lead to such high risk incidents. We collect historical data from three IT environments that run both IBM DB2 and Oracle DBMS. Then, we implement several linear and non-linear multivariate models to identify and learn from high risk configurations. We analyze their performance, in terms of accuracy, cost, generalization and interpretability. Results show that high risk configurations can be identified with extremely high accuracy and that the database administrator can potentially benefit from the rules extracted to reconfigure in order to prevent incidents.
An Optimal Online Algorithm For Retrieving Heavily Perturbed Statistical Databases In The Low-Dimensional Querying Model BIBAFull-Text 1381-1390
  Krzysztof Marcin Choromanski; Afshin Rostamizadeh; Umar Syed
We give the first Õ(1 over √ T)-error online algorithm for reconstructing noisy statistical databases, where T is the number of (online) sample queries received. The algorithm is optimal up to the poly(log(T)) factor in terms of the error and requires only O(log T) memory. It aims to learn a hidden database-vector w* Ε in RD in order to accurately answer a stream of queries regarding the hidden database, which arrive in an online fashion from some unknown distribution D. We assume the distribution D is defined on the neighborhood of a low-dimensional manifold. The presented algorithm runs in O(dD)-time per query, where d is the dimensionality of the query-space. Contrary to the classical setting, there is no separate training set that is used by the algorithm to learn the database -- the stream on which the algorithm will be evaluated must also be used to learn the database-vector. The algorithm only has access to a binary oracle Ο that answers whether a particular linear function of the database-vector plus random noise is larger than a threshold, which is specified by the algorithm. We note that we allow for a significant O(D) amount of noise to be added while other works focused on the low noise o(√D)-setting. For a stream of T queries our algorithm achieves an average error Õ(1 over √T) by filtering out random noise, adapting threshold values given to the oracle based on its previous answers and, as a consequence, recovering with high precision a projection of a database-vector w* onto the manifold defining the query-space. Our algorithm may be also applied in the adversarial machine learning context to compromise machine learning engines by heavily exploiting the vulnerabilities of the systems that output only binary signal and in the presence of significant noise.

Session 7B: Retrieval Enhancements 2

Aggregation of Crowdsourced Ordinal Assessments and Integration with Learning to Rank: A Latent Trait Model BIBAFull-Text 1391-1400
  Pavel Metrikov; Virgil Pavlu; Javed A. Aslam
Existing approaches used for training and evaluating search engines often rely on crowdsourced assessments of document relevance with respect to a user query. To use such assessments for either evaluation or learning, we propose a new framework for the inference of true document relevance from crowdsourced data -- one simpler than previous approaches and achieving better performance. For each assessor, we model assessor quality and bias in the form of Gaussian distributed class conditionals of relevance grades. For each document, we model true relevance and difficulty as continuous variables. We estimate all parameters from crowdsourced data, demonstrating better inference of relevance as well as realistic models for both documents and assessors.
   A document-pair likelihood model works best, and it is extended to pairwise learning to rank. Utilizing more information directly from the input data, it shows better performance as compared to existing state-of-the-art approaches for learning to rank from crowdsourced assessments. Experimental validation is performed on four TREC datasets.
Weakly Supervised Natural Language Processing Framework for Abstractive Multi-Document Summarization: Weakly Supervised Abstractive Multi-Document Summarization BIBAFull-Text 1401-1410
  Peng Li; Weidong Cai; Heng Huang
In this paper, we propose a new weakly supervised abstractive news summarization framework using pattern based approaches. Our system first generates meaningful patterns from sentences. Then, in order to precisely cluster patterns, we propose a novel semisupervised pattern learning algorithm that leverages a hand-crafted list of topic-relevant keywords, which are the only weakly supervised information used by our framework to generate aspect-oriented summarization. After that, our system generates new patterns by fusing existing patterns and selecting top ranked new patterns via the recurrent neural network language model. Finally, we introduce a new pattern based surface realization algorithm to generate abstractive summaries. Automatic and manual evaluations demonstrate the effectiveness and advantages of our new methods. Code is available at: https://github.com/jerryli1981
Short Text Similarity with Word Embeddings BIBAFull-Text 1411-1420
  Tom Kenter; Maarten de Rijke
Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length.
   We investigate whether determining short text similarity is possible using only semantic features -- where by semantic we mean, pertaining to a representation of meaning -- rather than relying on similarity in lexical or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity.
   We propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated. We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of short texts
   We show on a publicly available evaluation set commonly used for the task of semantic similarity that our method outperforms baseline methods that work under the same conditions.
Building Representative Composite Items BIBAFull-Text 1421-1430
  Vincent Leroy; Sihem Amer-Yahia; Eric Gaussier; Hamid Mirisaee
The problem of summarizing a large collection of homogeneous items has been addressed extensively in particular in the case of geo-tagged datasets (e.g. Flickr photos and tags). In our work, we study the problem of summarizing large collections of heterogeneous items. For example, a user planning to spend extended periods of time in a given city would be interested in seeing a map of that city with item summaries in different geographic areas, each containing a theater, a gym, a bakery, a few restaurants and a subway station. We propose to solve that problem by building representative Composite Items (CIs).
   To the best of our knowledge, this is the first work that addresses the problem of finding representative CIs for heterogeneous items. Our problem naturally arises when summarizing geo-tagged datasets but also in other datasets such as movie or music summarization. We formalize building representative CIs as an optimization problem and propose KFC, an extended fuzzy clustering algorithm to solve it. We show that KFC converges and run extensive experiments on a variety of real datasets that validate its effectiveness.

Session 7C: Search Mechanisms

More Accurate Question Answering on Freebase BIBAFull-Text 1431-1440
  Hannah Bast; Elmar Haussmann
Real-world factoid or list questions often have a simple structure, yet are hard to match to facts in a given knowledge base due to high representational and linguistic variability. For example, to answer "who is the ceo of apple" on Freebase requires a match to an abstract "leadership" entity with three relations "role", "organization" and "person", and two other entities "apple inc" and "managing director". Recent years have seen a surge of research activity on learning-based solutions for this method. We further advance the state of the art by adopting learning-to-rank methodology and by fully addressing the inherent entity recognition problem, which was neglected in recent works.
   We evaluate our system, called Aqqu, on two standard benchmarks, Free917 and WebQuestions, improving the previous best result for each benchmark considerably. These two benchmarks exhibit quite different challenges, and many of the existing approaches were evaluated (and work well) only for one of them. We also consider efficiency aspects and take care that all questions can be answered interactively (that is, within a second). Materials for full reproducibility are available on our website: http://ad.informatik.uni-freiburg.de/publications.
Improving Ranking Consistency for Web Search by Leveraging a Knowledge Base and Search Logs BIBAFull-Text 1441-1450
  Jyun-Yu Jiang; Jing Liu; Chin-Yew Lin; Pu-Jen Cheng
In this paper, we propose a new idea called ranking consistency in web search. Relevance ranking is one of the biggest problems in creating an effective web search system. Given some queries with similar search intents, conventional approaches typically only optimize ranking models by each query separately. Hence, there are inconsistent rankings in modern search engines. It is expected that the search results of different queries with similar search intents should preserve ranking consistency. The aim of this paper is to learn consistent rankings in search results for improving the relevance ranking in web search. We then propose a re-ranking model aiming to simultaneously improve relevance ranking and ranking consistency by leveraging knowledge bases and search logs. To the best of our knowledge, our work offers the first solution to improving relevance rankings with ranking consistency. Extensive experiments have been conducted using the Freebase knowledge base and the large-scale query-log of a commercial search engine. The experimental results show that our approach significantly improves relevance ranking and ranking consistency. Two user surveys on Amazon Mechanical Turk also show that users are sensitive and prefer the consistent ranking results generated by our model.
Assessing the Impact of Syntactic and Semantic Structures for Answer Passages Reranking BIBAFull-Text 1451-1460
  Kateryna Tymoshenko; Alessandro Moschitti
In this paper, we extensively study the use of syntactic and semantic structures obtained with shallow and deeper syntactic parsers in the answer passage reranking task. We propose several dependency-based structures enriched with Linked Open Data (LD) knowledge for representing pairs of questions and answer passages. We use such tree structures in learning to rank (L2R) algorithms based on tree kernel. The latter can represent questions and passages in a tree fragment space, where each substructure represents a powerful syntactic/semantic feature. Additionally since we define links between structures, tree kernels also generate relational features spanning question and passage structures. We derive very important findings, which can be useful to build state-of-the-art systems: (i) full syntactic dependencies can outperform shallow models also using external knowledge and (ii) the semantic information should be derived by effective and high-coverage resources, e.g., LD, and incorporated in syntactic structures to be effective. We demonstrate our findings by carrying out an extensive comparative experimentation on two different TREC QA corpora and one community question answer dataset, namely Answerbag. Our comparative analysis on well-defined answer selection benchmarks consistently demonstrates that our structural semantic models largely outperform the state of the art in passage reranking.
Ranking Entities for Web Queries Through Text and Knowledge BIBAFull-Text 1461-1470
  Michael Schuhmacher; Laura Dietz; Simone Paolo Ponzetto
When humans explain complex topics, they naturally talk about involved entities, such as people, locations, or events. In this paper, we aim at automating this process by retrieving and ranking entities that are relevant to understand free-text web-style queries like Argentine British relations, which typically demand a set of heterogeneous entities with no specific target type like, for instance, Falklands_-War} or Margaret-_Thatcher, as answer. Standard approaches to entity retrieval rely purely on features from the knowledge base. We approach the problem from the opposite direction, namely by analyzing web documents that are found to be query-relevant. Our approach hinges on entity linking technology that identifies entity mentions and links them to a knowledge base like Wikipedia. We use a learning-to-rank approach and study different features that use documents, entity mentions, and knowledge base entities -- thus bridging document and entity retrieval. Since established benchmarks for this problem do not exist, we use TREC test collections for document ranking and collect custom relevance judgments for entities. Experiments on TREC Robust04 and TREC Web13/14 data show that: i) single entity features, like the frequency of occurrence within the top-ranke documents, or the query retrieval score against a knowledge base, perform generally well; ii) the best overall performance is achieved when combining different features that relate an entity to the query, its document mentions, and its knowledge base representation.

Session 7D: Social Networks 3

What Is a Network Community?: A Novel Quality Function and Detection Algorithms BIBAFull-Text 1471-1480
  Atsushi Miyauchi; Yasushi Kawase
In this study, we introduce a novel quality function for a network community, which we refer to as the communitude. The communitude has a strong statistical background. Specifically, it measures the Z-score of a subset of vertices S with respect to the fraction of the number of edges within the subgraph induced by S. Due to the null model of a random graph used in the definition, our quality function focuses not only on the inside of the subgraph but also on the cut edges, unlike some quality functions for extracting dense subgraphs. To evaluate the detection ability of our quality function, we address the communitude maximization problem and its variants for realistic scenarios. For the problems, we propose a two-phase heuristic algorithm together with some modified versions. In the first phase, it repeatedly removes the vertex with the smallest degree, and then obtains the subgraph with maximum communitude over the iterations. In the second phase, the algorithm improves the obtained solution using a simple local search heuristic. This algorithm runs in linear time when the number of iterations is fixed to a constant; thus, it is applicable to massive graphs. Computational experiments using both synthetic graphs and real-world networks demonstrate the validity and reliability of the proposed quality function and algorithms.
DifRec: A Social-Diffusion-Aware Recommender System BIBAFull-Text 1481-1490
  Hossein Vahabi; Iordanis Koutsopoulos; Francesco Gullo; Maria Halkidi
Recommender systems used in current online social platforms make recommendations by only considering how relevant an item is to a specific user but they ignore the fact that, thanks to mechanisms like sharing or re-posting across the underlying social network, an item recommended to a user i propagates through the network and can reach another user j without needing to be explicitly recommended to j too. Overlooking this fact may lead to an inefficient use of the limited recommendation slots. These slots can instead be exploited more profitably by avoiding unnecessary duplicates and recommending other equally relevant items.
   In this work we take a step towards rethinking recommender systems by exploiting the anticipated social-network information diffusion and withholding recommendation of items that are expected to reach a user through sharing/re-posting. We devise a novel recommender system, DifRec, by formulating the problem of maximizing the total user engagement as an allocation problem in a properly-defined neighborhoodness graph, i.e., a graph that models the conflicts of recommending an item to a user who will receive it anyway by social diffusion. We show that the problem is NP-hard and propose efficient heuristics to solve it.
   We assess the performance of our DifRec by involving real data from Tumblr platform. We obtain substantial improvements in overall user engagement (130%-190%) over the real recommender system embedded in Tumblr and over various existing recommender systems.
Who With Whom And How?: Extracting Large Social Networks Using Search Engines BIBAFull-Text 1491-1500
  Stefan Siersdorfer; Philipp Kemkes; Hanno Ackermann; Sergej Zerr
Social network analysis is leveraged in a variety of applications such as identifying influential entities, detecting communities with special interests, and determining the flow of information and innovations. However, existing approaches for extracting social networks from unstructured Web content do not scale well and are only feasible for small graphs. In this paper, we introduce novel methodologies for query-based search engine mining, enabling efficient extraction of social networks from large amounts of Web data. To this end, we use patterns in phrase queries for retrieving entity connections, and employ a bootstrapping approach for iteratively expanding the pattern set. Our experimental evaluation in different domains demonstrates that our algorithms provide high quality results and allow for scalable and efficient construction of social graphs.
Modeling Individual-Level Infection Dynamics Using Social Network Information BIBAFull-Text 1501-1510
  Suppawong Tuarob; Conrad S. Tucker; Marcel Salathe; Nilam Ram
Epidemic monitoring systems engaged in accurate discovery of infected individuals enable better understanding of the dynamics of epidemics and thus may promote effective disease mitigation or prevention. Currently, infection discovery systems require either physical participation of potential patients or provision of information from hospitals and health-care services. While social media has emerged as an increasingly important knowledge source that reflects multiple real world events, there is only a small literature examining how social media information can be incorporated into computational epidemic models. In this paper, we demonstrate how social media information can be incorporated into and improve upon traditional techniques used to model the dynamics of infectious diseases. Using flu infection histories and social network data collected from 264 students in a college community, we identify social network signals that can aid identification of infected individuals. Extending the traditional SIRS model, we introduce and illustrate the efficacy of an Online-Interaction-Aware Susceptible-Infected-Recovered-Susceptible (OIA-SIRS) model based on four social network signals for modeling infection dynamics. Empirical evaluations of our case study, flu infection within a college community, reveal that the OIA-SIRS model is more accurate than the traditional model, and also closely tracks the real-world infection rates as reported by CDC ILINet and Google Flu Trend.

Session 8A: Query Evaluation

Finding Probabilistic k-Skyline Sets on Uncertain Data BIBAFull-Text 1511-1520
  Jinfei Liu; Haoyu Zhang; Li Xiong; Haoran Li; Jun Luo
Skyline is a set of points that are not dominated by any other point. Given uncertain objects, probabilistic skyline has been studied which computes objects with high probability of being skyline. While useful for selecting individual objects, it is not sufficient for scenarios where we wish to compute a subset of skyline objects, i.e., a skyline set. In this paper, we generalize the notion of probabilistic skyline to probabilistic k-skyline sets (Pk-SkylineSets) which computes k-object sets with high probability of being skyline set. We present an efficient algorithm for computing probabilistic k-skyline sets. It uses two heuristic pruning strategies and a novel data structure based on the classic layered range tree to compute the skyline set probability for each instance set with a worst-case time bound. The experimental results on the real NBA dataset and the synthetic datasets show that Pk-SkylineSets is interesting and useful, and our algorithms are efficient and scalable.
Ordering Selection Operators Under Partial Ignorance BIBAFull-Text 1521-1530
  Khaled H. Alyoubi; Sven Helmer; Peter T. Wood
Optimising queries in real-world situations under imperfect conditions is still a problem that has not been fully solved. We consider finding the optimal order in which to execute a given set of selection operators under partial ignorance of their selectivities. The selectivities are modelled as intervals rather than exact values and we apply a concept from decision theory, the minimisation of the maximum regret, as a measure of optimality. The associated decision problem turns out to be NP-hard, which renders a brute-force approach to solving it impractical. Nevertheless, by investigating properties of the problem and identifying special cases which can be solved in polynomial time, we gain insight that we use to develop a novel heuristic for solving the general problem. We also evaluate minmax regret query optimisation experimentally, showing that it outperforms a currently employed strategy of optimisers that uses mean values for uncertain parameters.
Querying Temporal Drifts at Multiple Granularities BIBAFull-Text 1531-1540
  Sofia Kleisarchaki; Sihem Amer-Yahia; Ahlame Douzal-Chouakria; Vassilis Christophides
There exists a large body of work on online drift detection with the goal of dynamically finding and maintaining changes in data streams. In this paper, we adopt a query-based approach to drift detection. Our approach relies on a drift index, a structure that captures drift at different time granularities and enables flexible drift queries. We formalize different drift queries that represent real-world scenarios and develop query evaluation algorithms that use different materializations of the drift index as well as strategies for online index maintenance. We describe a thorough study of the performance of our algorithms on real-world and synthetic datasets with varying change rates.
Efficient Incremental Evaluation of Succinct Regular Expressions BIBAFull-Text 1541-1550
  Henrik Björklund; Wim Martens; Thomas Timm
Regular expressions are omnipresent in database applications. They form the structural core of schema languages for XML, they are a fundamental ingredient for navigational queries in graph databases, and are being considered in languages for upcoming technologies such as schema- and transformation languages for tabular data on the Web. In this paper we study the usage and effectiveness of the counting operator (or: limited repetition) in regular expressions. The counting operator is a popular extension which is part of the POSIX standard and therefore also present in regular expressions in grep, Java, Python, Perl, and Ruby. In a database context, expressions with counting appear in XML Schema and languages for querying graphs such as SPARQL 1.1 and Cypher.
   We first present a practical study that suggests that counters are extensively used in practice. We then investigate evaluation methods for such expressions and develop a new algorithm for efficient incremental evaluation. Finally, we conduct an extensive benchmark study that shows that exploiting counting operators can lead to speed-ups of several orders of magnitude in a wide range of settings: normal and incremental evaluation on synthetic and real expressions.

Session 8B: Web Search

Struggling and Success in Web Search BIBAFull-Text 1551-1560
  Daan Odijk; Ryen W. White; Ahmed Hassan Awadallah; Susan T. Dumais
Web searchers sometimes struggle to find relevant information. Struggling leads to frustrating and dissatisfying search experiences, even if searchers ultimately meet their search objectives. Better understanding of search tasks where people struggle is important in improving search systems. We address this important issue using a mixed methods study using large-scale logs, crowd-sourced labeling, and predictive modeling. We analyze anonymized search logs from the Microsoft Bing Web search engine to characterize aspects of struggling searches and better explain the relationship between struggling and search success. To broaden our understanding of the struggling process beyond the behavioral signals in log data, we develop and utilize a crowd-sourced labeling methodology. We collect third-party judgments about why searchers appear to struggle and, if appropriate, where in the search task it became clear to the judges that searches would succeed (i.e., the pivotal query). We use our findings to propose ways in which systems can help searchers reduce struggling. Key components of such support are algorithms that accurately predict the nature of future actions and their anticipated impact on search outcomes. Our findings have implications for the design of search systems that help searchers struggle less and succeed more.
Behavioral Dynamics from the SERP's Perspective: What are Failed SERPs and How to Fix Them? BIBAFull-Text 1561-1570
  Julia Kiseleva; Jaap Kamps; Vadim Nikulin; Nikita Makarov
Web search is always in a state of flux: queries, their intent, and the most relevant content are changing over time, in predictable and unpredictable ways. Modern search technology has made great strides in keeping up to pace with these changes, but there remain cases of failure where the organic search results on the search engine result page (SERP) are outdated, and no relevant result is displayed. Failing SERPs due to temporal drift are one of the greatest frustrations of web searchers, leading to search abandonment or even search engine switch. Detecting failed SERPs timely and providing access to the desired out-of-SERP results has huge potential to improve user satisfaction. Our main findings are threefold: First, we refine the conceptual model of behavioral dynamics on the web by including the SERP and defining (un)successful SERPs in terms of observable behavior. Second, we analyse typical patterns of temporal change and propose models to predict query drift beyond the current SERP, and ways to adapt the SERP to include the desired results. Third, we conduct extensive experiments on real world search engine traffic demonstrating the viability of our approach. Our analysis of behavioral dynamics at the SERP level gives new insight in one of the primary causes of search failure due to temporal query intent drifts. Our overall conclusion is that the most detrimental cases in terms of (lack of) user satisfaction lead to the largest changes in information seeking behavior, and hence to observable changes in behavior we can exploit to detect failure, and moreover not only detect them but also resolve them.
What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries BIBAFull-Text 1571-1580
  Michael Völske; Pavel Braslavski; Matthias Hagen; Galina Lezina; Benno Stein
We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users' needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers' querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3-4% of the total search traffic.
   Since questions are such a small part of the query stream, and are more likely to be unique than shorter queries, click-through information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy.
   To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer.
Does Vertical Bring more Satisfaction?: Predicting Search Satisfaction in a Heterogeneous Environment BIBAFull-Text 1581-1590
  Ye Chen; Yiqun Liu; Ke Zhou; Meng Wang; Min Zhang; Shaoping Ma
The study of search satisfaction is one of the prime concerns in search performance evaluation research. Most existing works on search satisfaction primarily rely on the hypothesis that all results on search engine result pages (SERPs) are homogeneous. However, a variety of heterogeneous vertical results such as videos, images and instant answers are aggregated into SERPs by search engines to improve the diversity and quality of search results. In this paper, we carry out a lab-based user study with specifically designed SERPs to determine how verticals with different qualities and presentation styles affect search satisfaction. Users' satisfaction feedback and external assessors' satisfaction annotations are both collected to make a comparison regarding the perception of search satisfaction. Mouse click-through / movement data and eye movement information are also collected such that we can investigate the influence of vertical results from the perspectives of both benefit and cost. Finally, a vertical-aware learning-based prediction method is proposed to predict search satisfaction on aggregated SERPs. To the best of our knowledge, this paper is the first to analyze the effect of verticals on search satisfaction. The results show that verticals with different qualities, presentation styles and positions have different effects on search satisfaction, among which Encyclopedia verticals, as well as Download verticals, will bring the largest improvement. Furthermore, our proposed vertical-aware prediction method outperforms state-of-the-art methods that are designed for search satisfaction prediction in homogeneous environment.

Session 8C: Social Media 2

Characterizing and Predicting Viral-and-Popular Video Content BIBAFull-Text 1591-1600
  David Vallet; Shlomo Berkovsky; Sebastien Ardon; Anirban Mahanti; Mohamed Ali Kafaar
The proliferation of online video content has triggered numerous works on its evolution and popularity, as well as on the effect of social sharing on content propagation. In this paper, we focus on the observable dependencies between the virality of video content on a micro-blogging social network (in this case, Twitter) and the popularity of such content on a video distribution service (YouTube). To this end, we collected and analysed a corpus of Twitter posts containing links to YouTube clips and the corresponding video meta-data from YouTube. Our analysis highlights the unique properties of content that is both popular and viral, which allows such content to attract high number of views on YouTube and achieve fast propagation on Twitter. With this in mind, we proceed to the predictions of popular-and-viral clips and propose a framework that can, with high degree of accuracy and low amount of training data, predict videos that are likely to be popular, viral, and both. The key contribution of our work is the focus on cross-system dynamics between YouTube and Twitter. We conjecture and validate that cross-system prediction of both popularity and virality of videos is feasible, and can be performed with a reasonably high degree of accuracy. One of our key findings is that YouTube features capturing user engagement, have strong virality prediction capabilities. This findings allows to solely rely on data extracted from a video sharing service to predict popularity and virality aspects of videos.
Social Spammer and Spam Message Co-Detection in Microblogging with Social Context Regularization BIBAFull-Text 1601-1610
  Fangzhao Wu; Jinyun Shu; Yongfeng Huang; Zhigang Yuan
The popularity of microblogging platforms, such as Twitter, makes them important for information dissemination and sharing. However, they are also recognized as ideal places by spammers to conduct social spamming. Massive social spammers and spam messages heavily hurt the user experience and hinder the healthy development of microblogging systems. Thus, effectively detecting the social spammers and spam messages in microblogging is of great value. Existing studies mainly regard social spammer detection and spam message detection as two separate tasks. However, social spammers and spam messages have strong connections, since social spammers tend to post more spam messages and spam messages have high probabilities to be posted by social spammers. Combining social spammer detection with spam message detection has the potential to boost the performance of each task. In this paper, we propose a unified framework for social spammer and spam message co-detection in microblogging. Our framework utilizes the posting relations between users and messages to combine social spammer detection with spam message detection. In addition, we extract the social relations between users as well as the connections between messages, and incorporate them into our framework as regularization terms over the prediction results. Besides, we introduce an efficient optimization method to solve our framework. Extensive experiments on a real-world microblog dataset demonstrate that our framework can significantly and consistently improve the performance of both social spammer detection and spam message detection.
Central Topic Model for Event-oriented Topics Mining in Microblog Stream BIBAFull-Text 1611-1620
  Min Peng; Jiahui Zhu; Xuhui Li; Jiajia Huang; Hua Wang; Yanchun Zhang
To date, data generates and arrives in the form of stream to propagate discussions of public events in microblog services. Discovering event-oriented topics from the stream will lead to a better understanding of the change of public concern. However, as the massive scale of the data stream, traditional static topic models, such as LDA, are no longer fit for topic detection and tracking tasks. In this paper, we propose a central topic model (CenTM), where a Multi-view Clustering algorithm with Two-phase Random Walk (MC-TRW) is devised to aggregate the LDA's latent topics into central topics. Furthermore, we leverage the aggregation of central topics alternately with MC-TRW and sequential topic inference to improve the scalability in the stream fashion, so as to derive the dynamic central topic model (DCenTM). Specifically, our model is able to uncover the intrinsic characteristics of the central topics and predict the trend of their intensity along a life cycle. Experimental results demonstrate that the proposed central topic model is event-oriented and of high generalization, it therefore can dispose the topic trend prediction effectively and precisely in massive data stream.
Video Popularity Prediction by Sentiment Propagation via Implicit Network BIBAFull-Text 1621-1630
  Wanying Ding; Yue Shang; Lifan Guo; Xiaohua Hu; Rui Yan; Tingting He
Video popularity prediction plays a foundational role in many aspects of life, such as recommendation systems and investment consulting. Because of its technological and economic importance, this problem has been extensively studied for years. However, four constraints have limited most related works' usability. First, most feature oriented models are inadequate in the social media environment, because many videos are published with no specific content features, such as a strong cast or a famous script. Second, many studies assume that there is a linear correlation existing between view counts from early and later days, but this is not the case in every scenario. Third, numerous works just take view counts into consideration, but discount associated sentiments. Nevertheless, it is the public opinions that directly drive a video's final success/failure. Also, many related approaches rely on a network topology, but such topologies are unavailable in many situations. Here, we propose a Dual Sentimental Hawkes Process (DSHP) to cope with all the problems above. DSHP's innovations are reflected in three ways: (1) it breaks the "Linear Correlation" assumption, and implements Hawkes Process; (2) it reveals deeper factors that affect a video's popularity; and (3) it is topology free. We evaluate DSHP on four types of videos: Movies, TV Episodes, Music Videos, and Online News, and compare its performance against 6 widely used models, including Translation Model, Multiple Linear Regression, KNN Regression, ARMA, Reinforced Poisson Process, and Univariate Hawkes Process. Our model outperforms all of the others, which indicates a promising application prospect.

Session 8D: Recommendation

Joint Modeling of User Check-in Behaviors for Point-of-Interest Recommendation BIBAFull-Text 1631-1640
  Hongzhi Yin; Xiaofang Zhou; Yingxia Shao; Hao Wang; Shazia Sadiq
Point-of-Interest (POI) recommendation has become an important means to help people discover attractive and interesting locations, especially when users travel out of town. However, extreme sparsity of user-POI matrix creates a severe challenge. To cope with this challenge, a growing line of research has exploited the temporal effect, geographical-social influence, content effect and word-of-mouth effect. However, current research lacks an integrated analysis of the joint effect of the above factors to deal with the issue of data-sparsity, especially in the out-of-town recommendation scenario which has been ignored by most existing work.
   In light of the above, we propose a joint probabilistic generative model to mimic user check-in behaviors in a process of decision making, which strategically integrates the above factors to effectively overcome the data sparsity, especially for out-of-town users. To demonstrate the applicability and flexibility of our model, we investigate how it supports two recommendation scenarios in a unified way, i.e., home-town recommendation and out-of-town recommendation. We conduct extensive experiments to evaluate the performance of our model on two real large-scale datasets in terms of both recommendation effectiveness and efficiency, and the experimental results show its superiority over other competitors.
ORec: An Opinion-Based Point-of-Interest Recommendation Framework BIBAFull-Text 1641-1650
  Jia-Dong Zhang; Chi-Yin Chow; Yu Zheng
As location-based social networks (LBSNs) rapidly grow, it is a timely topic to study how to recommend users with interesting locations, known as points-of-interest (POIs). Most existing POI recommendation techniques only employ the check-in data of users in LBSNs to learn their preferences on POIs by assuming a user's check-in frequency to a POI explicitly reflects the level of her preference on the POI. However, in reality users usually visit POIs only once, so the users' check-ins may not be sufficient to derive their preferences using their check-in frequencies only. Actually, the preferences of users are exactly implied in their opinions in text-based tips commenting on POIs. In this paper, we propose an opinion-based POI recommendation framework called ORec to take full advantage of the user opinions on POIs expressed as tips. In ORec, there are two main challenges: (i) detecting the polarities of tips (positive, neutral or negative), and (ii) integrating them with check-in data including social links between users and geographical information of POIs. To address these two challenges, (1) we develop a supervised aspect-dependent approach to detect the polarity of a tip, and (2) we devise a method to fuse tip polarities with social links and geographical information into a unified POI recommendation framework. Finally, we conduct a comprehensive performance evaluation for ORec using two large-scale real data sets collected from Foursquare and Yelp. Experimental results show that ORec achieves significantly superior polarity detection and POI recommendation accuracy compared to other state-of-the-art polarity detection and POI recommendation techniques.
Toward Dual Roles of Users in Recommender Systems BIBAFull-Text 1651-1660
  Suhang Wang; Jiliang Tang; Huan Liu
Users usually play dual roles in real-world recommender systems. One is as a reviewer who writes reviews for items with rating scores, and the other is as a rater who rates the helpfulness scores of reviews. Traditional recommender systems mainly consider the reviewer role while not taking into account the rater role. However, the rater role allows users to express their opinions toward reviews about items; hence it may indirectly indicate their opinions about items, which could be complementary to the reviewer role. Since most real-world recommender systems provide convenient mechanisms for the rater role, recent studies show that typically there are much more helpfulness ratings from the rater role than item ratings from the reviewer role. Therefore, incorporating the rater role of users may have the potentials to mitigate the data sparsity and cold-start problems in traditional recommender systems. In this paper, we investigate how to exploit dual roles of users in recommender systems. In particular, we provide a principled way to exploit the rater role mathematically and propose a novel recommender system DualRec, which captures both the reviewer role and the rater role of users simultaneously for recommendation. Experimental results on two real world datasets demonstrate the effectiveness of the proposed framework, and further experiments are conducted to understand the importance of the rater role of users in recommendation.
TriRank: Review-aware Explainable Recommendation by Modeling Aspects BIBAFull-Text 1661-1670
  Xiangnan He; Tao Chen; Min-Yen Kan; Xiao Chen
Most existing collaborative filtering techniques have focused on modeling the binary relation of users to items by extracting from user ratings. Aside from users' ratings, their affiliated reviews often provide the rationale for their ratings and identify what aspects of the item they cared most about. We explore the rich evidence source of aspects in user reviews to improve top-N recommendation. By extracting aspects (i.e., the specific properties of items) from textual reviews, we enrich the user -- item binary relation to a user -- item -- aspect ternary relation. We model the ternary relation as a heterogeneous tripartite graph, casting the recommendation task as one of vertex ranking. We devise a generic algorithm for ranking on tripartite graphs -- TriRank -- and specialize it for personalized recommendation. Experiments on two public review datasets show that it consistently outperforms state-of-the-art methods. Most importantly, TriRank endows the recommender system with a higher degree of explainability and transparency by modeling aspects in reviews. It allows users to interact with the system through their aspect preferences, assisting users in making informed decisions.

Short Papers: Databases

RoadRank: Traffic Diffusion and Influence Estimation in Dynamic Urban Road Networks BIBAFull-Text 1671-1674
  Tarique Anwar; Chengfei Liu; Hai L. Vu; Md. Saiful Islam
With the rapidly growing population in urban areas, these days the urban road networks are expanding at a faster rate. The frequent movement of people on them leads to traffic congestions. These congestions originate from some crowded road segments, and diffuse towards other parts of the urban road networks creating further congestions. This behavior of road networks motivates the need to understand the influence of individual road segments on others in terms of congestion. In this work, we propose RoadRank, an algorithm to compute the influence scores of each road segment in an urban road network, and rank them based on their overall influence. It is an incremental algorithm that keeps on updating the influence scores with time, by feeding with the latest traffic data at each time point. The method starts with constructing a directed graph called influence graph, which is then used to iteratively compute the influence scores using probabilistic diffusion theory. We show promising preliminary experimental results on real SCATS traffic data of Melbourne.
On Query-Update Independence for SPARQL BIBAFull-Text 1675-1678
  Nicola Guido; Pierre Genevès; Nabil Layaïda; Cécile Roisin
This paper investigates techniques for detecting independence of SPARQL queries from updates. A query is independent of an update when the execution of the update does not affect the result of the query. Determining independence is especially useful in the context of huge RDF repositories, where it permits to avoid expensive yet useless re-evaluation of queries. While this problem has been intensively studied for fragments of relational calculus, very few works exist for the standard query language for the semantic web. We report on our investigations on how a notion of independence can be defined in the SPARQL context.
A Structured Query Model for the Deep Relational Web BIBAFull-Text 1679-1682
  Hasan M. Jamil; Hosagrahar V. Jagadish
The deep web is very large and diverse and queries evaluated against the deep web can provide great value. While there have been attempts at accessing the data in the deep web, these are clever "one-of'' systems and techniques. In this paper, we describe an ongoing research of a generic structured query model that can be used against the deep web. Using this query model, the contributions of a community of researchers can be combined freely, leading to a system that can be improved incrementally each time someone develops a specific novel technique to improve a particular operator.
A Flash-aware Buffering Scheme using On-the-fly Redo BIBAFull-Text 1683-1686
  Kyosung Jeong; Sang-Wook Kim; Sungchae Lim
In this paper, we address how to reduce the amount of page updates in flash-based DBMS equipped with SSD (Solid State Drive). We propose a novel buffering scheme that evicts a dirty page X without flushing it into SSD, and restores the right image of X when X is requested for later access. The restoration of X having previous flushing-less eviction is performed through our online redo actions on X. We call this page-restoring online redo the on-the-fly redo. Although our on-the-fly redo mechanism has some overhead of increasing the number of page reads, this can be compensated by infrequent page updates. Additionally, since the proposed buffering scheme with the on-the-fly redo can easily support the no-steal policy in buffer management, we can enjoy the advantages of smaller logging overhead and faster recovery. Through the TPC-C benchmarks using a Berkeley DB, we show that our scheme shortens the transaction processing times by up to 53%.
Defragging Subgraph Features for Graph Classification BIBAFull-Text 1687-1690
  Haishuai Wang; Peng Zhang; Ivor Tsang; Ling Chen; Chengqi Zhang
Graph classification is an important tool for analysing structured and semi-structured data, where subgraphs are commonly used as the feature representation. However, the number and size of subgraph features crucially depend on the threshold parameters of frequent subgraph mining algorithms. Any improper setting of the parameters will generate many trivial short-pattern subgraph fragments which dominate the feature space, distort graph classifiers and bury interesting long-pattern subgraphs. In this paper, we propose a new Subgraph Join Feature Selection (SJFS) algorithm. The SJFS algorithm, by forcing graph classifiers to join short-pattern subgraph fragments, can defrag trivial subgraph features and deliver long-pattern interesting subgraphs. Experimental results on both synthetic and real-world social network graph data demonstrate the performance of the proposed method.
Structural Constraints for Multipartite Entity Resolution with Markov Logic Network BIBAFull-Text 1691-1694
  Tengyuan Ye; Hady W. Lauw
Multipartite entity resolution seeks to match entity mentions across several collections. An entity mention is presumed unique within a collection, and thus could match at most one entity mention in each of the other collections. In addition to domain-specific features considered in entity resolution, there are a number of domain-invariant structural constraints that apply in this scenario, including one-to-one assignment as well as cross-collection transitivity. We propose a principled solution to the multipartite entity resolution problem, building on the foundation of Markov Logic Network (MLN) that combines probabilistic graphical model and first-order logic. We describe how the domain-invariant structural constraints could be expressed appropriately in terms of Markov logic, flexibly allowing joint modeling with domain-specific features. Experiments on two real-life datasets, each spanning four collections, show the utility of this approach and validate the contributions of various MLN components.

Short Papers: Information Retrieval

Know Your Onions: Understanding the User Experience with the Knowledge Module in Web Search BIBAFull-Text 1695-1698
  Ioannis Arapakis; Luis A. Leiva; B. Barla Cambazoglu
The increasing availability of large volumes of human-curated content is shifting web search towards a paradigm that introduces seamlessly more semantic information to search engine result pages. This trend has resulted in the design of a new element known as the knowledge module (KM) where certain facts about named entities, obtained from various knowledge bases, are shown to users. So far, little has been done to uncover the role that this module plays on user experience in web search and whether it is perceived by users as a useful aid for their search tasks. Our work is an early attempt to bridge this gap. To this end, we conducted a crowdsourcing study aimed at understanding the effect of the KM on users' search experience and its overall utility. In particular, our study is the first to provide insights about the noticeability and usefulness of the KM in web search, together with comprehensive analyses of usability and workload.
Personalized Federated Search at LinkedIn BIBAFull-Text 1699-1702
  Dhruv Arya; Viet Ha-Thuc; Shakti Sinha
LinkedIn has grown to become a platform hosting diverse sources of information ranging from member profiles, jobs, professional groups, slideshows etc. Given the existence of multiple sources, when a member issues a query like "software engineer", the member could look for software engineer profiles, jobs or professional groups. To tackle this problem, we exploit a data-driven approach that extracts searcher intents from their profile data and recent activities at a large scale. The intents such as job seeking, hiring, content consuming are used to construct features to personalize federated search experience. We tested the approach on the LinkedIn homepage and A/B tests show significant improvements in member engagement. As of writing this paper, the approach powers all of federated search on LinkedIn homepage.
Balancing Exploration and Exploitation: Empirical Parameterization of Exploratory Search Systems BIBAFull-Text 1703-1706
  Kumaripaba Ahukorala; Alan Medlar; Kalle Ilves; Dorota Glowacka
Exploratory searches are where a user has insufficient knowledge to define exact search criteria or does not otherwise know what they are looking for. Reinforcement learning techniques have demonstrated great potential for supporting exploratory search in information retrieval systems as they allow the system to trade-off exploration (presenting the user with alternatives topics) and exploitation (moving toward more specific topics). Users of such systems, however, often feel that the system is not responsive to user needs. This problem is not an inherent feature of such systems, but is caused by the exploration rate parameter being inappropriately tuned for a given system, dataset or user.
   We present a user study to analyze how different exploration rates affect search performance, user satisfaction, and the number of documents selected. We show that the tradeoff between exploration and exploitation can be modelled as a direct relationship between the exploration rate parameter from the reinforcement learning algorithm and the number of relevant documents returned to the user over the course of a search session. We define the optimal exploration/exploitation trade-off as where this relationship is maximised and show this point to be broadly concordant with user satisfaction and performance.
On Predicting Deletions of Microblog Posts BIBAFull-Text 1707-1710
  Mossaab Bagdouri; Douglas W. Oard
Among the many classification tasks on Twitter content, predicting whether a tweet will be deleted has to date received relatively little attention. Deletions occur for a variety of reasons, which can make the classification task challenging. Moreover, deletion prediction might serve different goals, the characteristics of which should be reflected in the evaluation design. This paper addresses the problem of deletion prediction by analyzing the distribution of deleted tweets, presenting a new evaluation framework, exploring tweet-based and user-based features, and reporting prediction scores.
Semi-Automated Text Classification for Sensitivity Identification BIBAFull-Text 1711-1714
  Giacomo Berardi; Andrea Esuli; Craig Macdonald; Iadh Ounis; Fabrizio Sebastiani
Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.
Identification of Microblogs Prominent Users during Events by Learning Temporal Sequences of Features BIBAFull-Text 1715-1718
  Imen Bizid; Nibal Nayef; Patrice Boursier; Sami Faiz; Antoine Doucet
During specific real-world events, some users of microblogging platforms could provide exclusive information about those events. The identification of such prominent users depends on several factors such as the freshness and the relevance of their shared information. This work proposes a probabilistic model for the identification of prominent users in microblogs during specific events. The model is based on learning and classifying user behavior over time using Mixture of Gaussians Hidden Markov Models. A user is characterized by a temporal sequence of feature vectors describing his activities. The features computed at each time-stamp are designed to reflect both the on- and off-topic activities of users, and they are computationally feasible in real-time.
   To validate the efficacy of our proposed model, we have conducted experiments on data collected from Twitter during the Herault floods that have occurred in France. The achieved results show that learning the time-series of users' actions is better than learning just those actions without temporal information.
A Real-Time Eye Tracking Based Query Expansion Approach via Latent Topic Modeling BIBAFull-Text 1719-1722
  Yongqiang Chen; Peng Zhang; Dawei Song; Benyou Wang
Formulating and reformulating reliable textual queries have been recognized as a challenging task in Information Retrieval (IR), even for experienced users. Most existing query expansion methods, especially those based on implicit relevance feedback, utilize the user's historical interaction data, such as clicks, scrolling and viewing time on documents, to derive a refined query model. It is further expected that the user's search experience would be largely improved if we could dig out user's latent query intention, in real-time, by capturing the user's current interaction at the term level directly. In this paper, we propose a real-time eye tracking based query expansion method, which is able to: (1) automatically capture the terms that the user is viewing by utilizing eye tracking techniques; (2) derive the user's latent intent based on the eye tracking terms and by using the Latent Dirichlet Allocation (LDA) approach. A systematic user study has been carried out and the experimental results demonstrate the effectiveness of our proposed methods.
Clustered Semi-Supervised Relevance Feedback BIBAFull-Text 1723-1726
  Kripabandhu Ghosh; Swapan Kumar Parui
In relevance feedback, first-round search results are used to boost second-round search results. Two forms have been traditionally considered: exhaustively labelled feedback, where all first-round results to depth k are annotated for relevance by the user; and blind feedback, where the top-k results are all assumed to be relevant. In this paper, we consider an intermediate, semi-supervised scheme, in which only a subset of results is selected for annotation, and then their labels are propagated to their nearest neighbours. Specifically, we use clustering to determine the nearest-neighbour groups, and seed selection to choose documents for annotation. We find that the effectiveness of this method is indistinguishable from the exhaustive relevance feedback, and is significantly higher than both blind feedback and the use of the annotated subset alone. We show that this approach works well in environments in which some but limited amounts of human feedback are available, such as early case assessment in e-discovery.
On the Effect of "Stupid" Search Components on User Interaction with Search Engines BIBAFull-Text 1727-1730
  Lidia Grauer; Aleksandra Lomakina
Using eye-tracking, we investigate how searchers interact with Web search engines which get affected by nonsensical results. We conduct a user survey to choose "stupid" components for our laboratory experiment and explore the most conspicuous ones.
   This research provides insights about searchers' interactions with different kinds of "stupid" search components, such as organic search results, vertical results, ads and automatic misspell correction. We investigate the influence of each class of "stupid" components on users' attitude to a search engine. We found that sticking in memory of the impression about the "stupidity" of the search engine depended on whether the users were finally satisfied with their searches, or did not find the answer. Experimental results show that classes of "stupid" components can be differentiated by their influence on users' attitude. The most negative impression is caused by word losses, word collocation breaks and inappropriate misspell corrections.
Social-Relational Topic Model for Social Networks BIBAFull-Text 1731-1734
  Weiyu Guo; Shu Wu; Liang Wang; Tieniu Tan
Social networking services, such as Twitter and Sina Weibo, have tremendous popularity in recent years. Mass of short texts and social links are aggregated into these service platforms. To realize personalized services on social network, topic inference from both short texts and social links plays more and more important role. Most conventional topic modeling methods focus on analyzing formal texts, e.g., papers, news and blogs, and usually assume that the links are only generated by topical factors. As a result, on social network, the learned topics of these methods are usually affected by topic-irrelevant links. Recently, a few approaches use artificial priors to recognize the links generated by the popularity factor in topic modeling. However, employing global priors, these methods can not well capture the distinct properties of each link and still suffer from the effect of topic-irrelevant links. To address the above limitations, we propose a novel Social-Relational Topic Model (SRTM), which can alleviate the effect of topic-irrelevant links by analyzing relational users' topics of each link. SRTM jointly models texts and social links for learning the topic distribution and topical influence of each user. The experimental results show that, our model outperforms the state-of-the-arts in topic modeling and social link prediction.
Building Effective Query Classifiers: A Case Study in Self-harm Intent Detection BIBAFull-Text 1735-1738
  Ashiqur R. KhudaBukhsh; Paul N. Bennett; Ryen W. White
Query-based triggers play a crucial role in modern search systems, e.g., in deciding when to display direct answers on result pages. We address a common scenario in designing such triggers for real-world settings where positives are rare and search providers possess only a small seed set of positive examples to learn query classification models. We choose the critical domain of self-harm intent detection to demonstrate how such small seed sets can be expanded to create meaningful training data with a sizable fraction of positive examples. Our results show that with our method, substantially more positive queries can be found compared to plain random sampling. Additionally, we explored the effectiveness of traditional active learning approaches on classification performance and found that maximum uncertainty performs the best among several other techniques that we considered.
Modelling the Usefulness of Document Collections for Query Expansion in Patient Search BIBAFull-Text 1739-1742
  Nut Limsopatham; Craig Macdonald; Iadh Ounis
Dealing with the medical terminology is a challenge when searching for patients based on the relevance of their medical records towards a given query. Existing work used query expansion (QE) to extract expansion terms from different document collections to improve query representation. However, the usefulness of particular document collections for QE was not measured and taken into account during retrieval. In this work, we investigate two automatic approaches that measure and leverage the usefulness of document collections when exploiting multiple document collections to improve query representation. These two approaches are based on resource selection and learning to rank techniques, respectively. We evaluate our approaches using the TREC Medical Records track's test collection. Our results show the potential of the proposed approaches, since they can effectively exploit 14 different document collections, including both domain-specific (e.g. MEDLINE abstracts) and generic (e.g. blogs and webpages) collections, and significantly outperform existing effective baselines, including the best systems participating at the TREC Medical Records track. Our analysis shows that the different collections are not equally useful for QE, while our two approaches can automatically weight the usefulness of expansion terms extracted from different document collections effectively.
A Convolutional Click Prediction Model BIBAFull-Text 1743-1746
  Qiang Liu; Feng Yu; Shu Wu; Liang Wang
The explosion in online advertisement urges to better estimate the click prediction of ads. For click prediction on single ad impression, we have access to pairwise relevance among elements in an impression, but not to global interaction among key features of elements. Moreover, the existing method on sequential click prediction treats propagation unchangeable for different time intervals. In this work, we propose a novel model, Convolutional Click Prediction Model (CCPM), based on convolution neural network. CCPM can extract local-global key features from an input instance with varied elements, which can be implemented for not only single ad impression but also sequential ad impression. Experiment results on two public large-scale datasets indicate that CCPM is effective on click prediction.
A Study of Query Length Heuristics in Information Retrieval BIBAFull-Text 1747-1750
  Yuanhua Lv
Query length has generally been regarded as a query-specific constant that does not affect document ranking. In this paper, we reveal that query length actually interacts with term frequency (TF) normalization, a key component of all effective retrieval models. Specifically, the longer the query is, the smaller the TF decay speed should be. In order to study the impact of query length, we present a desirable formal constraint to capture the heuristic of query length for retrieval. Our constraint analysis shows that current state-of-the-art retrieval functions, including BM25 and language models, fail to satisfy the constraint, and that, in order to solve this problem, the TF normalization component in a retrieval function should be adapted to query length. As an application, we develop a simple regression algorithm to adapt BM25 to query length, and demonstrate its effectiveness on several representative TREC collections.
Detect Rumors Using Time Series of Social Context Information on Microblogging Websites BIBAFull-Text 1751-1754
  Jing Ma; Wei Gao; Zhongyu Wei; Yueming Lu; Kam-Fai Wong
Automatically identifying rumors from online social media especially microblogging websites is an important research issue. Most of existing work for rumor detection focuses on modeling features related to microblog contents, users and propagation patterns, but ignore the importance of the variation of these social context features during the message propagation over time. In this study, we propose a novel approach to capture the temporal characteristics of these features based on the time series of rumor's lifecycle, for which time series modeling technique is applied to incorporate various social context information. Our experiments using the events in two microblog datasets confirm that the method outperforms state-of-the-art rumor detection approaches by large margins. Moreover, our model demonstrates strong performance on detecting rumors at early stage after their initial broadcast.
Query Auto-Completion for Rare Prefixes BIBAFull-Text 1755-1758
  Bhaskar Mitra; Nick Craswell
Query auto-completion (QAC) systems typically suggest queries that have previously been observed in search logs. Given a partial user query, the system looks up this query prefix against a precomputed set of candidates, then orders them using ranking signals such as popularity. Such systems can only recommend queries for prefixes that have been previously seen by the search engine with adequate frequency. They fail to recommend if the prefix is sufficiently rare such that it has no matches in the precomputed candidate set.
   We propose a design of a QAC system that can suggest completions for rare query prefixes. In particular, we describe a candidate generation approach using frequently observed query suffixes mined from historical search logs. We then describe a supervised model for ranking these synthetic suggestions alongside the traditional full-query candidates. We further explore ranking signals that are appropriate for both types of candidates based on n-gram statistics and a convolutional latent semantic model (CLSM). Within our supervised framework the new features demonstrate significant improvements in performance over the popularity-based baseline. The synthetic query suggestions complement the existing popularity-based approach, helping users formulate rare queries.
Pooled Evaluation Over Query Variations: Users are as Diverse as Systems BIBAFull-Text 1759-1762
  Alistair Moffat; Falk Scholer; Paul Thomas; Peter Bailey
Evaluation of information retrieval systems with test collections makes use of a suite of fixed resources: a document corpus; a set of topics; and associated judgments of the relevance of each document to each topic. With large modern collections, exhaustive judging is not feasible. Therefore an approach called pooling is typically used where, for example, the documents to be judged can be determined by taking the union of all documents returned in the top positions of the answer lists returned by a range of systems. Conventionally, pooling uses system variations to provide diverse documents to be judged for a topic; different user queries are not considered. We explore the ramifications of user query variability on pooling, and demonstrate that conventional test collections do not cover this source of variation. The effect of user query variation on the size of the judging pool is just as strong as the effect of retrieval system variation. We conclude that user query variation should be incorporated early in test collection construction, and cannot be considered effectively post hoc.
The Influence of Pre-processing on the Estimation of Readability of Web Documents BIBAFull-Text 1763-1766
  João Rafael de Moura Palotti; Guido Zuccon; Allan Hanbury
This paper investigates the effect that text pre-processing approaches have on the estimation of the readability of web pages. Readability has been highlighted as an important aspect of web search result personalisation in previous work. The most widely used text readability measures rely on surface level characteristics of text, such as the length of words and sentences. We demonstrate that different tools for extracting text from web pages lead to very different estimations of readability. This has an important implication for search engines because search result personalisation strategies that consider users reading ability may fail if incorrect text readability estimations are computed.
Atypical Queries in eCommerce BIBAFull-Text 1767-1770
  Neeraj Pradhan; Vinay Deolalikar; Kang Li
Understanding how specific, ambiguous, or broad the intent of a search query is, across all users of the system, is important in improving search relevance in eCommerce. There is scant literature on such a structural characterization of queries in eCommerce. In this paper, we use query-click log data to address the problem of identifying "atypical queries": these are queries that are extremal in terms of specificity, ambiguity, or breadth of intent. We isolate three components of atypicality: geometric, statistical, and topological. We demonstrate, using query-click logs at Groupon, that certain combinations of these properties render a query atypical, and discuss how search analysts treat such queries differently. Our work is being used to improve search relevance at Groupon.
Bottom-up Faceted Search: Creating Search Neighbourhoods with Datacube Cells BIBAFull-Text 1771-1774
  Mark Sifer
Browsing a collection can start with a keyword search. A user visits a library, performs a keyword search to find a few books of interest; finding their location in the library. Then they go to these locations; the corresponding bookshelves, where they do not just retrieve the found books, but rather they start browsing the nearby books; the books which have a similar Dewey classification. This paper extends this approach to curated corpora that contain items or documents that have been classified in multiple dimensions (facets), where each dimension classification may be a hierarchy. In particular (i) a technique for determining near items based on OLAP datacube cells and (ii) user interfaces that support browsing of near items are presented.
Personalized Recommendation Meets Your Next Favorite BIBAFull-Text 1775-1778
  Qiang Song; Jian Cheng; Ting Yuan; Hanqing Lu
A comprehensive understanding of user's item selection behavior is not only essential to many scientific disciplines, but also has a profound business impact on online recommendation. Recent researches have discovered that user's favorites can be divided into 2 categories: long-term and short-term. User's item selection behavior is a mixed decision of her long and short-term favorites. In this paper, we propose a unified model, namely States Transition pAir-wise Ranking Model (STAR), to address users' favorites mining for sequential-set recommendation. Our method utilizes a transition graph for collaborative filtering that accounts for mining user's short-term favorites, jointed with a generative topic model for expressing user's long-term favorites. Furthermore, a user's specific prior is introduced into our unified model for better modeling personalization. Technically, we develop a pair-wise ranking loss function for parameters learning. Empirically, we measure the effectiveness of our method using two real-world datasets and the results show that our method outperforms state-of-the-art methods.
Recommending Short-lived Dynamic Packages for Golf Booking Services BIBAFull-Text 1779-1782
  Robin Swezey; Young-joo Chung
We introduce an approach to recommending short-lived dynamic packages for golf booking services. Two challenges are addressed in this work. The first is the short life of the items, which puts the system in a state of a permanent cold start. The second is the uninformative nature of the package attributes, which makes clustering or figuring latent packages challenging. Although such settings are fairly pervasive, they have not been studied in traditional recommendation research, and there is thus a call for original approaches for recommender systems. In this paper, we introduce a hybrid method that leverages user analysis and its relation to the packages, as well as package pricing and environmental analysis, and traditional collaborative filtering. The proposed approach achieved appreciable improvement in precision compared with baselines.
Large-Scale Question Answering with Joint Embedding and Proof Tree Decoding BIBAFull-Text 1783-1786
  Zhenghao Wang; Shengquan Yan; Huaming Wang; Xuedong Huang
Question answering (QA) over a large-scale knowledge base (KB) such as Freebase is an important natural language processing application. There are linguistically oriented semantic parsing techniques and machine learning motivated statistical methods. Both of these approaches face a key challenge on how to handle diverse ways natural questions can be expressed about predicates and entities in the KB. This paper is to investigate how to combine these two approaches. We frame the problem from a proof-theoretic perspective, and formulate it as a proof tree search problem that seamlessly unifies semantic parsing, logic reasoning, and answer ranking. We combine our word entity joint embedding learned from web-scale data with other surface-form features to further boost accuracy improvements. Our real-time system on the Freebase QA task achieved a very high F1 score (47.2) on the standard Stanford WebQuestions benchmark test data.
Query Length, Retrievability Bias and Performance BIBAFull-Text 1787-1790
  Colin Wilkie; Leif Azzopardi
Past work has shown that longer queries tend to lead to better retrieval performance. However, this comes at the cost of increased user effort effort and additional system processing. In this paper, we examine whether there are benefits of longer queries beyond performance. We posit that increasing the query length will also lead to a reduction in the retrievability bias. Additionally, we speculate that to minimise retrievability bias as queries become longer, more length normalisation must be applied to account for the increase in the length of documents retrieved. To this end, we perform a retrievability analysis on two TREC collections using three standard retrieval models and various lengths of queries (one to five terms). From this investigation we find that increasing the length of queries reduces the overall retrievability bias but at a decreasing rate. Moreover, once the query length exceeds three terms the bias can begin to increase (and the performance can start to drop). We also observe that more document length normalisation is typically required as query length increases, in order to minimise bias. Finally, we show that there is a strong correlation between performance and retrieval bias. This work raises some interesting questions regarding query length and its affect on performance and bias. Further work will be directed towards examining longer and more verbose queries, including those generated via query expansion methods, to obtain a more comprehensive understanding of the relationship between query length, performance and retrievability bias.
Gauging Correct Relative Rankings For Similarity Search BIBAFull-Text 1791-1794
  Weiren Yu; Julie McCann
One of the important tasks in link analysis is to quantify the similarity between two objects based on hyperlink structure. SimRank is an attractive similarity measure of this type. Existing work mainly focuses on absolute SimRank scores, and often harnesses an iterative paradigm to compute them. While these iterative scores converge to exact ones with the increasing number of iterations, it is still notoriously difficult to determine how well the relative orders of these iterative scores can be preserved for a given iteration. In this paper, we propose efficient ranking criteria that can secure correct relative orders of node-pairs with respect to SimRank scores when they are computed in an iterative fashion. Moreover, we show the superiority of our criteria in harvesting top-K SimRank scores and bucket orders from a full ranking list. Finally, viable empirical studies verify the usefulness of our techniques for SimRank top-K ranking and bucket ordering.
Learning User Preferences for Topically Similar Documents BIBAFull-Text 1795-1798
  Mustafa Zengin; Ben Carterette
Similarity measures have been used widely in information retrieval research. Most research has been done on query-document or document-document similarity without much attention to the user's perception of similarity in the context of the information need. In this study, we collect user preference judgements of web document similarity in order to investigate: (1) the correlation between similarity measures and users' perception of similarity, (2) the correlation between the web document features plus document-query features and users' similarity judgements. We analyze the performance of various similarity methods at predicting user preferences, in both unsupervised and supervised settings. We show that a supervised approach using many features is able to predict user preferences close to the level of agreement between users, and moreover achieve a 15% improvement in AUC over an unsupervised approach.
Modeling Parameter Interactions in Ranking SVM BIBAFull-Text 1799-1802
  Yaogong Zhang; Jun Xu; Yanyan Lan; Jiafeng Guo; Maoqiang Xie; Yalou Huang; Xueqi Cheng
Ranking SVM, which formalizes the problem of learning a ranking model as that of learning a binary SVM on preference pairs of documents, is a state-of-the-art ranking model in information retrieval. The dual form solution of Ranking SVM model can be written as a linear combination of the preference pairs, i.e., w = Σ(i,j) αij (xi -- xj), where αij denotes the Lagrange parameters associated with each pair (i,j). It is obvious that there exist significant interactions over the document pairs because two preference pairs could share a same document as their items. Thus it is natural to ask if there also exist interactions over the model parameters αij, which we may leverage to propose better ranking model. This paper aims to answer the question. Firstly, we found that there exists a low-rank structure over the Ranking SVM model parameters αij, which indicates that the interactions do exist. Then, based on the discovery, we made a modification on the original Ranking SVM model by explicitly applying a low-rank constraint to the parameters. Specifically, each parameter αij is decomposed as a product of two low-dimensional vectors, i.e., αij = vi, vj, where vectors vi and vj correspond to document i and j, respectively. The learning process, thus, becomes to optimize the modified dual form objective function with respect to the low-dimensional vectors. Experimental results on three LETOR datasets show that our method, referred to as Factorized Ranking SVM, can outperform state-of-the-art baselines including the conventional Ranking SVM.

Short Papers: Knowledge Management

Best First Over-Sampling for Multilabel Classification BIBAFull-Text 1803-1806
  Xusheng Ai; Jian Wu; Victor S. Sheng; Yufeng Yao; Pengpeng Zhao; Zhiming Cui
Learning from imbalanced multilabel data is a challenging task. It has attracted considerable attention recently. In this paper we propose a MultiLabel Best First Over-sampling (ML-BFO) to improve the performance of multilabel classification algorithms, based on imbalance minimization and Wilson's ENN rule. Our experimental results show that ML-BFO not only duplicates fewer samples but also reduces the imbalance level much more than two state-of-the-art multilabel sampling methods, i.e., an over-sampling method LP-ROS and an under-sampling method MLeNN. Besides, ML-BFO significantly improves the performance of multilabel classification algorithms, and performs much better than LP-ROS and MLeNN.
Co-clustering Document-term Matrices by Direct Maximization of Graph Modularity BIBAFull-Text 1807-1810
  Melissa Ailem; François Role; Mohamed Nadif
We present Coclus, a novel diagonal co-clustering algorithm which is able to effectively co-cluster binary or contingency matrices by directly maximizing an adapted version of the modularity measure traditionally used for networks. While some effective co-clustering algorithms already exist that use network-related measures (normalized cut, modularity), they do so by using spectral relaxations of the discrete optimization problems. In contrast, Coclus allows to get even better co-clusters by directly maximizing modularity using an iterative alternating optimization procedure. Extensive comparative experiments performed on various document-term datasets demonstrate that our algorithm is very effective, stable and outperforms other co-clustering algorithms.
A Data-Driven Approach to Distinguish Cyber-Attacks from Physical Faults in a Smart Grid BIBAFull-Text 1811-1814
  Adnan Anwar; Abdun Naser Mahmood; Zubair Shah
Recently, there has been significant increase in interest on Smart Grid security. Researchers have proposed various techniques to detect cyber-attacks using sensor data. However, there has been little work to distinguish a cyber-attack from a power system physical fault. A serious operational failure in physical power grid may occur from the mitigation strategies if fault is wrongly classified as a cyber-attack or vice-versa. In this paper, we utilize a data-driven approach to accurately differentiate the physical faults from cyber-attacks. First, we create a realistic dataset by generating different types of faults and cyber-attacks on the IEEE 30 bus benchmark test system. With extensive experiments, we observe that most of the established supervised methods perform poorly for the classification of faults and cyber-attacks specially for the practical datasets. Hence, we provide a data-driven approach where labelled data are projected in a new low-dimensional subspace using Principal Component Analysis (PCA). Next, Sequential Minimal Optimization (SMO) based Support Vectors are trained using the new projection of the original dataset. With both simulated and practical datasets, we have observed that the proposed classification method outperforms other existing popular supervised classification approaches considering the cyber-attack and fault datasets.
Improving Event Detection by Automatically Assessing Validity of Event Occurrence in Text BIBAFull-Text 1815-1818
  Andrea Ceroni; Ujwal Kumar Gadiraju; Marco Fisichella
Manually inspecting text to assess whether an event occurs in a document collection is an onerous and time consuming task. Although a manual inspection to discard the false events would increase the precision of automatically detected sets of events, it is not a scalable approach. In this paper, we automatize event validation, defined as the task of determining whether a given event occurs in a given document or corpus. The introduction of automatic event validation as a post-processing step of event detection can boost the precision of the detected event set, discarding false events and preserving the true ones. We propose a novel automatic method for event validation, which relies on a supervised model to predict the occurrence of events in a non-annotated corpus. The data for training the model is gathered by exploiting the crowdsourcing paradigm. Experiments on real-world events and documents show that our proposed method (i) outperforms the state-of-the-art event validation approach and (ii) increases the precision of event detection while preserving recall.
DAAV: Dynamic API Authority Vectors for Detecting Software Theft BIBAFull-Text 1819-1822
  Dong-Kyu Chae; Sang-Wook Kim; Seong-Je Cho; Yesol Kim
This paper proposes a novel birthmark, a dynamic API authority vector (DAAV), for detecting software theft. DAAV satisfies four essential requirements for good birthmarks -- credibility, resiliency, scalability, and packing-free -- while existing birthmarks fail to satisfy all of them together. In particular, existing static birthmarks are unable to handle the packed programs and existing dynamic birthmarks do not satisfy credibility and resiliency. Our experimental results demonstrate that DAAV provides satisfying credibility and resiliency compared with existing dynamic birthmarks and also can cover packed programs.
Towards Multi-level Provenance Reconstruction of Information Diffusion on Social Media BIBAFull-Text 1823-1826
  Tom De Nies; Io Taxidou; Anastasia Dimou; Ruben Verborgh; Peter M. Fischer; Erik Mannens; Rik Van de Walle
In order to assess the trustworthiness of information on social media, a consumer needs to understand where this information comes from, and which processes were involved in its creation. The entities, agents and activities involved in the creation of a piece of information are referred to as its provenance, which was standardized by W3C PROV. However, current social media APIs cannot always capture the full lineage of every message, leaving the consumer with incomplete or missing provenance, which is crucial for judging the trust it carries. Therefore in this paper, we propose an approach to reconstruct the provenance of messages on social media on multiple levels. To obtain a fine-grained level of provenance, we use an approach from prior work to reconstruct information cascades with high certainty, and map them to PROV using the PROV-SAID extension for social media. To obtain a coarse-grained level of provenance, we adapt our similarity-based, fuzzy provenance reconstruction approach -- previously applied on news. We illustrate the power of the combination by providing the reconstructed provenance of a limited social media dataset gathered during the 2012 Olympics, for which we were able to reconstruct a significant amount of previously unidentified connections.
Profiling Pedestrian Distribution and Anomaly Detection in a Dynamic Environment BIBAFull-Text 1827-1830
  Minh Tuan Doan; Sutharshan Rajasegarar; Mahsa Salehi; Masud Moshtaghi; Christopher Leckie
Pedestrians movements have a major impact on the dynamics of cities and provide valuable guidance to city planners. In this paper we model the normal behaviours of pedestrian flows and detect anomalous events from pedestrian counting data of the City of Melbourne. Since the data spans an extended period, and pedestrian activities can change intermittently (e.g., activities in winter vs. summer), we applied an Ensemble Switching Model, which is a dynamic anomaly detection technique that can accommodate systems that switch between different states. The results are compared with those produced by a static clustering model (HyCARCE) and also cross-validated with known events. We found that the results from the Ensemble Switching Model are valid and more accurate than HyCARCE.
A Clustering-based Approach to Detect Probable Outcomes of Lawsuits BIBAFull-Text 1831-1834
  Daniel Lemes Gribel; Maira Gatti de Bayser; Leonardo Guerreiro Azevedo
The numerous lawsuits in progress or already judged by the Brazilian Supreme Court consists of a large amount of non-structured data. This leads to a large number of hidden or unknown information, since some relationships between lawsuits are not explicit in the available data; and contributes to generate non-intuitive influences between variables, which in addition increases the degree of uncertainty on judicial outcomes. This work proposes an approach to identify possible judgment outcomes that considers the use of similarity calculations and clustering mechanisms based on lawsuits patterns. The similarity problem was tackled by analysing metadata manually extracted from lawsuits; and this work also presents an approach to detect clusters and to compile past votes. From the results, it is possible to verify lawsuits most likely outcomes and to detect their degree of uncertainty.
Detecting Check-worthy Factual Claims in Presidential Debates BIBAFull-Text 1835-1838
  Naeemul Hassan; Chengkai Li; Mark Tremayne
Public figures such as politicians make claims about "facts" all the time. Journalists and citizens spend a good amount of time checking the veracity of such claims. Toward automatic fact checking, we developed tools to find check-worthy factual claims from natural language sentences. Specifically, we prepared a U.S. presidential debate dataset and built classification models to distinguish check-worthy factual claims from non-factual claims and unimportant factual claims. We also identified the most-effective features based on their impact on the classification models' accuracy.
Where You Go Reveals Who You Know: Analyzing Social Ties from Millions of Footprints BIBAFull-Text 1839-1842
  Hsun-Ping Hsieh; Rui Yan; Cheng-Te Li
This paper aims to investigate how the geographical footprints of users correlate to their social ties. While conventional wisdom told us that the more frequently two users co-locate in geography, the higher probability they are friends, we find that in real geo-social data, Gowalla and Meetup, almost all of the user pairs with friendships had never met geographically. In this sense, can we discover social ties among users purely using their geographical footprints even if they never met? To study this question, we develop a two-stage feature engineering framework. The first stage is to characterize the direct linkages between users through their spatial co-locations while the second is to capture the indirect linkages between them via a co-location graph. Experiments conducted on Gowalla check-in data and Meetup meeting events exhibit not only the superiority of our feature model, but also validate the predictability (with 70% accuracy) of detecting social ties solely from user footprints.
Message Clustering based Matrix Factorization Model for Retweeting Behavior Prediction BIBAFull-Text 1843-1846
  Bo Jiang; Jiguang Liang; Ying Sha; Lihong Wang
Retweeting is an important mechanism for information diffusion in social networks. Through retweeting, message is reshared from one user to another user, forming large cascades of message forwarding. Most existing researches of predicting retweeting utilize user social relationships for modeling which leads to vast calculating amount. In this paper, we propose two message clustering based matrix factorization models for retweeting prediction. Unlike previous approaches, our models exploit the clustering relationships of messages instead of social relationships. Our models are quite general because we do not need any auxiliary information except for message content. Several experiments on real datasets show that our models are effective and outperform the state-of-the-art methods.
Heterogeneous Multi-task Semantic Feature Learning for Classification BIBAFull-Text 1847-1850
  Xin Jin; Fuzhen Zhuang; Sinno Jialin Pan; Changying Du; Ping Luo; Qing He
Multi-task Learning (MTL) aims to learn multiple related tasks simultaneously instead of separately to improve generalization performance of each task. Most existing MTL methods assumed that the multiple tasks to be learned have the same feature representation. However, this assumption may not hold for many real-world applications. In this paper, we study the problem of MTL with heterogeneous features for each task. To address this problem, we first construct an integrated graph of a set of bipartite graphs to build a connection among different tasks. We then propose a multi-task nonnegative matrix factorization (MTNMF) method to learn a common semantic feature space underlying different heterogeneous feature spaces of each task. Finally, based on the common semantic features and original heterogeneous features, we model the heterogenous MTL problem as a multi-task multi-view learning (MTMVL) problem. In this way, a number of existing MTMVL methods can be applied to solve the problem effectively. Extensive experiments on three real-world problems demonstrate the effectiveness of our proposed method.
Top-k Reliable Edge Colors in Uncertain Graphs BIBAFull-Text 1851-1854
  Arijit Khan; Francesco Gullo; Thomas Wohler; Francesco Bonchi
We study the fundamental problem of finding the set of top-k edge colors that maximizes the reliability between a source node and a destination node in an uncertain and edge-colored graph. Our top-k reliable color set problem naturally arises in a variety of real-world applications including pathway finding in biological networks, topic-aware influence maximization, and team formation in social networks, among many others. In addition to the #P-completeness of the classical reliability finding problem between a source and a destination node over an uncertain graph, we prove that our problem is also NP-hard, and neither sub-modular, nor super-modular. To this end, we aim at designing effective and scalable solutions for the top-k reliable color set problem. We first introduce two baselines following the idea of repetitive inclusion of the next best edge colors, and we later develop a more efficient and effective algorithm that directly finds the highly-reliable paths while maintaining the budget on the number of edge-colors. An extensive empirical evaluation on various large-scale and real-world graph datasets illustrates that our proposed techniques are both scalable and highly accurate.
Probabilistic Non-negative Inconsistent-resolution Matrices Factorization BIBAFull-Text 1855-1858
  Masahiro Kohjima; Tatsushi Matsubayashi; Hiroshi Sawada
In this paper, we tackle with the problem of analyzing datasets with different resolution such as a pair of user's individual data and user group's data, for example "userA visited shopA 5 times" and "users whose attributes are men purchased itemA 80 times in total". In order to establish a basic approach to this problem, we focus on the simplified scenario and propose a new probabilistic model called probabilistic non-negative inconsistent-resolution matrices factorization (pNimf). pNimf is rigorously derived from the data generative process using latent high-resolution data which underlie low-resolution data. We conduct experiments on real purchase log data and confirm that the proposed model provides superior performance, and that the performance improves as the number of low-resolution data increases. These results imply that our way of modeling using latent high-resolution data can become the basic approach to the problem of analyzing dataset with different resolution.
Identifying Attractive News Headlines for Social Media BIBAFull-Text 1859-1862
  Sawa Kourogi; Hiroyuki Fujishiro; Akisato Kimura; Hitoshi Nishikawa
In the past, leading newspaper companies and broadcasters were the sole distributors of news articles, and thus news consumers simply received news articles from those outlets at regular intervals. However, the growth of social media and smart devices led to a considerable change in this traditional relationship between news providers and consumers. Hundreds of thousands of news articles are now distributed on social media, and consumers can access those articles at any time via smart devices. This has meant that news providers are under pressure to find ways of engaging the attention of consumers. This paper provides a novel solution to this problem by identifying attractive headlines as a gateway to news articles. We first perform one of the first investigations of news headlines on a major viral medium. Using our investigation as a basis, we also propose a learning-to-rank method that suggests promising news headlines. Our experiments with 2,000 news articles demonstrate that our proposed method can accurately identify attractive news headlines from the candidates and reveals several promising factors of making news articles go viral.
A Probabilistic Rating Auto-encoder for Personalized Recommender Systems BIBAFull-Text 1863-1866
  Huizhi Liang; Timothy Baldwin
User profiling is a key component of personalized recommender systems, and is used to generate user profiles that describe individual user interests and preferences. The increasing availability of big data is driving the urgent need for user profiling algorithms that are able to generate accurate user profiles from large-scale user behavior data. In this paper, we propose a probabilistic rating auto-encoder to perform unsupervised feature learning and generate latent user feature profiles from large-scale user rating data. Based on the generated user profiles, neighbourhood based collaborative filtering approaches have been adopted to make personalized rating predictions. The effectiveness of the proposed approach is demonstrated in experiments conducted on a real-world rating dataset from yelp.com.
Real-time Rumor Debunking on Twitter BIBAFull-Text 1867-1870
  Xiaomo Liu; Armineh Nourbakhsh; Quanzhi Li; Rui Fang; Sameena Shah
In this paper, we propose the first real time rumor debunking algorithm for Twitter. We use cues from 'wisdom of the crowds', that is, the aggregate 'common sense' and investigative journalism of Twitter users. We concentrate on identification of a rumor as an event that may comprise of one or more conflicting microblogs. We continue monitoring the rumor event and generate real time updates dynamically based on any additional information received. We show using real streaming data that it is possible, using our approach, to debunk rumors accurately and efficiently, often much faster than manual verification by professionals.
Fraud Transaction Recognition: A Money Flow Network Approach BIBAFull-Text 1871-1874
  Renxin Mao; Zhao Li; Jinhua Fu
In this paper, we provide some insights into analysis of fraud transaction recognition on Alipay's Money Flow Network. We first show that the Money Flow Network follows a power-law distribution on daily, monthly or yearly basis, based on which we propose a new approach of fraud transaction recognition on the Money Flow Network from two perspectives. First, the Collapse Network is identified by the discovery that fraud transaction requires a huge amount of active controlled 'zombie' accounts, which are always intentionally manipulated by fraudulent online sellers, and the collapse of the Money Flow Network emerges due to their economic inactivity; Second, we define the Activation Forest that leads to the recognition of the controlled 'zombies' even no sooner than they enter into Alipay's ecosphere. These two networks are fully explored from the perspective of detecting 'zombies', and several key features have been adopted into anti-fraud recognition. Experimental results show that our strategy is capable of effectively identifying fraud transactions on the Money Flow Network with the accuracy as high as 99.88%.
Identifying Top-k Consistent News-Casters on Twitter BIBAFull-Text 1875-1878
  Sahisnu Mazumder; Sameep Mehta; Dhaval Patel
News-casters are Twitter users who periodically pick up interesting news from online news media and spread it to their followers' network. Existing works on Twitter user analysis have only analysed a pre-defined set of users for user modeling, influence analysis and news recommendation. The problem of identifying prominent, trustworthy and consistent news-casters is unaddressed so far. In this paper, we present a framework, NCFinder, to discover top-k consistent news-casters directly from Twitter. NCFinder uses news headlines published in online news sources to periodically collect authentic news-tweets and processes them to discover news-casters, news sources and news concepts. Next, NCFinder builds a tripartite graph among news-casters, news source and news concepts and employs HITS algorithm on it to score the news-casters on daily basis. The daily score profiles of the news-casters collected over a time-period are then used to infer top-$k$ consistent news-casters. We run NCFinder from 11th Nov. to 24th Nov., 2014 and discover top-100 consistent news-casters and their profile information.
Mining the Minds of Customers from Online Chat Logs BIBAFull-Text 1879-1882
  Kunwoo Park; Jaewoo Kim; Jaram Park; Meeyoung Cha; Jiin Nam; Seunghyun Yoon; Eunhee Rhim
This study investigates factors that may determine satisfaction in customer service operations. We utilized more than 170,000 online chat sessions between customers and agents to identify characteristics of chat sessions that incurred dissatisfying experience. Quantitative data analysis suggests that sentiments or moods conveyed in online conversation are the most predictive factor of perceived satisfaction. Conversely, other session related meta data (such as that length, time of day, and response time) has a weaker correlation with user satisfaction. Knowing in advance what can predict satisfaction allows customer service staffs to identify potential weaknesses and improve the quality of service for better customer experience.
A Fast k-Nearest Neighbor Search Using Query-Specific Signature Selection BIBAFull-Text 1883-1886
  Youngki Park; Heasoo Hwang; Sang-goo Lee
k-nearest neighbor (k-NN) search aims at finding k points nearest to a query point in a given dataset. k-NN search is important in various applications, but it becomes extremely expensive in a high-dimensional large dataset. To address this performance issue, locality-sensitive hashing (LSH) is suggested as a method of probabilistic dimension reduction while preserving the relative distances between points. However, the performance of existing LSH schemes is still inconsistent, requiring a large amount of search time in some datasets while the k-NN approximation accuracy is low. In this paper, we target on improving the performance of k-NN search and achieving a consistent k-NN search that performs well in various datasets. In this regard, we propose a novel LSH scheme called Signature Selection LSH (S2LSH). First, we generate a highly diversified signature pool containing signature regions of various sizes and shapes. Then, for a given query point, we rank signature regions of the query and select points in the highly ranked signature regions as k-NN candidates of the query. Extensive experiments show that our approach consistently outperforms the state-of-the-art LSH schemes.
Core-Sets For Canonical Correlation Analysis BIBAFull-Text 1887-1890
  Saurabh Paul
Canonical Correlation Analysis (CCA) is a technique that finds how "similar" are the subspaces that are spanned by the columns of two different matrices A έℜ(of size m-x-n) and B έℜ(of size m-x-l). CCA measures similarity by means of the cosines of the so-called principal angles between the two subspaces. Those values are also known as canonical correlations of the matrix pair (A,B). In this work, we consider the over-constrained case where the number of rows is greater than the number of columns (m > max(n,l)). We study the problem of constructing "core-sets" for CCA. A core-set is a subset of rows from A and the corresponding subset of rows from B -- denoted by  and B, respectively. A "good" core-set is a subset of rows such that the canonical correlations of the core-set (Â, B) are "close" to the canonical correlations of the original matrix pair (A, B). There is a natural tradeoff between the core-set size and the approximation accuracy of a core-set. We present two algorithms namely, single-set spectral sparsification and leverage-score sampling, which find core-sets with additive-error guarantees to canonical correlations.
DeepCamera: A Unified Framework for Recognizing Places-of-Interest based on Deep ConvNets BIBAFull-Text 1891-1894
  Pai Peng; Hongxiang Chen; Lidan Shou; Ke Chen; Gang Chen; Chang Xu
In this work, we present a novel project called DeepCamera (DC) for recognizing places-of-interest (POI) with smartphones. Our framework is based on deep convolutional neural networks (ConvNets) which are currently state-of-the-art solutions to vision recognition tasks such as our mission. We propose a novel ConvNet by introducing a new layer called "spatial layer" which captures spatial knowledge from a geographic view. As a result, both spatial and visual knowledge contribute to generating a hybrid probability distribution over all possible POI candidates. Furthermore, we compress multiple trained deep ConvNets into one single shallow net called "shNet" which achieves competitive performance with ensemble methods. Our preliminary experiments conducted on real-world dataset have shown promising POI recognition results.
Structured Sparse Regression for Recommender Systems BIBAFull-Text 1895-1898
  Mingjie Qian; Liangjie Hong; Yue Shi; Suju Rajan
Feature-based collaborative filtering models, such as state-of-the-art factorization machines and regression-based latent factor models, rarely consider features' structural information, ignoring the heterogeneity of inter-type and intra-type relationships. Naïvely treating all feature pairs equally would potentially deteriorate the overall recommendation performance. In addition, human prior knowledge and other hierarchical or graphical structures are often available for some features, e.g., the country-state-city hierarchy for geographic features and the topical taxonomy for article features. It is a challenge to utilize the prior knowledge to further boost performance of state-of-the-art models. In this paper we employ rich features from both user and item sides to enhance latent factors learnt from interaction data, uncovering hidden structures from features' relationships and learning sparse pairwise and tree structural connections among features. Our framework borrows the modeling strengh from both structural sparsity modeling and latent factor models. Experiments on a real-world large-scale recommendation data set demonstrated that the proposed model outperforms several strong state-of-the-art baselines.
Analyzing Document Intensive Business Processes using Ontology BIBAFull-Text 1899-1902
  Suman Roychoudhury; Vinay Kulkarni; Nikhil Bellarykar
Knowledge is manifested in an enterprise in various forms ranging from unstructured operational data, to structured information like programs, as well as relational data stored in databases to semi-structured information stored in XML files. This information embodies the core of an enterprise knowledge base and analyzing the knowledge base can result in intelligent decision making. In order to realize this goal we begin with representing and analyzing unstructured knowledge present in an enterprise. In particular, this paper presents a real life example of a document intensive business process (International Trade) and attempts to model and analyze the process in a formal way. Typically, the information contained in a document intensive business process is of operational nature and requires extensive manual verification, which is both time consuming and error prone. Therefore, this research aims to eliminate such exhaustive manual verification by constructing a knowledge base in the form of ontology and apply suitable rule based reasoners to automate the verification process.
Transductive Domain Adaptation with Affinity Learning BIBAFull-Text 1903-1906
  Le Shu; Longin Jan Latecki
We study the problem of domain adaptation, which aims to adapt the classifiers trained on a labeled source domain to an unlabeled target domain. We propose a novel method to solve domain adaptation task in a transductive setting. The proposed method bridges the distribution gap between source domain and target domain through affinity learning. It exploits the existence of a subset of data points in target domain which distribute similarly to the data points in the source domain. These data points act as the bridge that facilitates the data similarities propagation across domains. We also propose to control the relative importance of intra- and inter-domain similarities to boost the similarity propagation. In our approach, we first construct the similarity matrix which encodes both the intra- and inter-domain similarities. We then learn the true similarities among data points in joint manifold using graph diffusion.
   We demonstrate that with improved similarities between source and target data, spectral embedding provides a better data representation, which boosts the prediction accuracy. The effectiveness of our method is validated on standard benchmark datasets for visual object recognition (multi-category).
Update Summarization using Semi-Supervised Learning Based on Hellinger Distance BIBAFull-Text 1907-1910
  Dingding Wang; Sahar Sohangir; Tao Li
Update summarization aims to generate brief summaries of recent documents to capture new information different from earlier documents. In this paper, we propose a new method to generate the sentence similarity graph using a novel similarity measure based on Helliger distance and apply semi-supervised learning on the sentence graph to select the sentences with maximum consistency and minimum redundancy to form the summaries. We use TAC 2011 data to evaluate our proposed method and compare it with existing baselines. The experimental results show the effectiveness of our proposed method.
Multi-view Clustering via Structured Low-rank Representation BIBAFull-Text 1911-1914
  Dong Wang; Qiyue Yin; Ran He; Liang Wang; Tieniu Tan
In this paper, we present a novel solution to multi-view clustering through a structured low-rank representation. When assuming similar samples can be linearly reconstructed by each other, the resulting representational matrix reflects the cluster structure and should ideally be block diagonal. We first impose low-rank constraint on the representational matrix to encourage better grouping effect. Then representational matrices under different views are allowed to communicate with each other and share their mutual cluster structure information. We develop an effective algorithm inspired by iterative re-weighted least squares for solving our formulation. During the optimization process, the intermediate representational matrix from one view serves as a cluster structure constraint for that from another view. Such mutual structural constraint fine-tunes the cluster structures from both views and makes them more and more agreeable. Extensive empirical study manifests the superiority and efficacy of the proposed method.
Partially Labeled Data Tuple Can Optimize Multivariate Performance Measures BIBAFull-Text 1915-1918
  Jim Jing-Yan Wang; Xin Gao
Multivariate performance measure optimization refers to learning predictive models such that a desired complex performance measure can be optimized over a training set, such as the F1 score. Up to now, all the existing multivariate performance measure optimization methods are limited to a completely labeled data tuple, i.e., the label tuple is complete. However, in real-world applications, sometimes it is difficult to obtain a complete label tuple. In this paper, we show that the multivariate performance measures can also be optimized by learning from partially labeled data tuple, when the label tuple is incomplete. We introduce a slack label tuple to represent the sought complete true label tuple, and learn it jointly with a hyper predictor, so that it can be consistent to the known labels, prediction results, and is smooth in the neighborhood. We develop an iterative learning algorithm to learn the slack label tuple and the hyper predictor. Its advantage over state-of-the-art multivariate performance measure optimization methods is shown by experiments on benchmark data sets.
Modeling Infinite Topics on Social Behavior Data with Spatio-temporal Dependence BIBAFull-Text 1919-1922
  Peng Wang; Peng Zhang; Chuan Zhou; Zhao Li; Guo Li
The problem of modeling topics on user behavior data in social networks has been widely studied in social marketing and social emotion analysis, where latent topic models are commonly used as the solutions. The user behavior data are highly related in time and space, which demands new latent topic models that consider both temporal and spatial distances. However, existing topic models either fail to model these two factors simultaneously, or cannot handle the high order dependence among user behaviors. In this paper we present a new nonparametric Bayesian model Time and Space Dependent Chinese Restaurant Processes (TSD-CRP for short). TSD-CRP can auto-select the number of topics and model high-order temporal and spatial dependence behind user behavior data. Empirical results on real-world data sets demonstrate the effectiveness of the proposed method.
ASEM: Mining Aspects and Sentiment of Events from Microblog BIBAFull-Text 1923-1926
  Ruhui Wang; Weijing Huang; Wei Chen; Tengjiao Wang; Kai Lei
Microblogs contain the most up-to-date and abundant opinion information on current events. Aspect-based opinion mining is a good way to get a comprehensive summarization of events. The most popular aspect based opinion mining models are used in the field of product and service. However, existing models are not suitable for event mining. In this paper we propose a novel probabilistic generative model (ASEM) to simultaneously discover aspects and the specified opinions. ASEM incorporate a sequence labeling model (CRF) into a generative topic model. Additionally, we adopt a set of features for separating aspects and sentiments. Moreover, we novelly present a continuously learning model. It can utilize the knowledge of one event to learn another, and get a better performance. We use five real world events to do experiment. The experimental results show that ASEM extracts aspects and sentiments well, and ASEM outperforms other state-of-art models and the intuitive two-step method.
Enhanced Word Embeddings from a Hierarchical Neural Language Model BIBAFull-Text 1927-1930
  Xun Wang; Katsuhoto Sudoh; Masaaki Nagata
This paper proposes a neural language model to capture the interaction of text units of different levels, i.e., documents, paragraphs, sentences, words in an hierarchical structure. At each paralleled level, the model incorporates Markov property while each higher-level unit hierarchically influences its containing units. Such an architecture enables the learned word embeddings to encode both global and local information. We evaluate the learned word embeddings and experiments demonstrate the effectiveness of our model.
Improving Label Quality in Crowdsourcing Using Noise Correction BIBAFull-Text 1931-1934
  Jing Zhang; Victor S. Sheng; Jian Wu; Xiaoqin Fu; Xindong Wu
This paper proposes a novel framework that introduces noise correction techniques to further improve label quality after ground truth inference in crowdsourcing. In the framework, an adaptive voting noise correction algorithm (AVNC) is proposed to identify and correct the most likely noises with the help of estimated qualities of labelers provided by the ground truth inference. The experimental results on two real-world datasets show that (1) the framework can improve label quality regardless of inference algorithms, especially under the circumstance that each example has a few noisy labels; and (2) since the algorithm AVNC considers both the number of and the probability of potential noises, it outperforms a baseline noise correction algorithm.
Improving Collaborative Filtering via Hidden Structured Constraint BIBAFull-Text 1935-1938
  Qing Zhang; Houfeng Wang
Matrix factorization models, as one of the most powerful Collaborative Filtering approaches, have greatly advanced the recommendation tasks. However, few of them are able to explicitly consider structured constraint for modeling user interests. To solve this problem, we propose a novel matrix factorization model with adaptive graph regularization framework, which can automatically discover latent user communities jointly with learning latent user representations, to enhance the discriminative power for recommendation. Experiments on real-world datasets demonstrate the effectiveness of the proposed method.

Workshop Reports

DOLAP 2015 Workshop Summary BIBAFull-Text 1939-1940
  Carlos Garcia-Alvarado; Carlos Ordonez; Il-Yeol Song
The ACM DOLAP workshop presents research that bridges data warehousing, On-Line Analytical Processing (OLAP), and other large-scale data processing platforms. The program has four interesting sessions on data warehouse design, database modeling, query processing, and text processing, as well as an invited paper on Big Data Database Design.
DTMBIO 2015: International Workshop on Data and Text Mining in Biomedical Informatics BIBAFull-Text 1941-1942
  Min Song; Doheon Lee; Karin Verspoor
Held each year in conjunction with one of the largest data management conferences, CIKM, the Ninth ACM International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO'15) is organized to bring together researchers interested in development and application of cutting-edge data management and analysis methods with a specific focus on applications in biology and medicine. The purpose of DTMBIO is to foster discussions regarding the state-of-the-art applications of data and text mining on biomedical research problems. DTMBIO'15 will help scientists understand emerging trends and opportunities in the evolving area of informatics related techniques and problems in the context of biomedical research.
ECol 2015: First international workshop on the Evaluation on Collaborative Information Seeking and Retrieval BIBAFull-Text 1943-1944
  Leif Azzopardi; Jeremy Pickens; Tetsuya Sakai; Laure Soulier; Lynda Tamine
Collaborative Information Seeking/Retrieval (CIS/CIR) has given rise to several challenges in terms of search behavior analysis, retrieval model formalization as well as interface design. However, the major issue of evaluation in CIS/CIR is still underexplored. The goal of this workshop is to investigate the evaluation challenges in CIS/CIR with the hope of building standardized evaluation frameworks, methodologies, and task specifications that would foster and grow the research area (in a collaborative fashion).
Eighth Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR'15) BIBAFull-Text 1945-1946
  Krisztian Balog; Jeffrey Dalton; Antoine Doucet; Yusra Ibrahim
The amount of structured content published on the Web has been growing rapidly, making it possible to address increasingly complex information access tasks. Recent years have witnessed the emergence of large scale human-curated knowledge bases as well as a growing array of techniques that identify or extract information automatically from unstructured and semi-structured sources. The ESAIR workshop series aims to advance the general research agenda on the problem of creating and exploiting semantic annotations. The eighth edition of ESAIR sets its focus on applications. We dedicate a special "annotations in action" track to demonstrations that showcase innovative prototype systems, in addition to the regular research and position paper contributions. The workshop also features invited talks from leaders in the field. The desired outcome of ESAIR'15 is a roadmap and research agenda that guides academic efforts and aligns them with industrial directions and developments.
LSDS-IR'15: 2015 Workshop on Large-Scale and Distributed Systems for Information Retrieval BIBAFull-Text 1947-1948
  Ismail Sengor Altingovde; B. Barla Cambazoglu; Nicola Tonellotto
The growth of the Web and other Big Data sources lead to important performance problems for large-scale and distributed information retrieval systems. The scalability and efficiency of such information retrieval systems have an impact on their effectiveness, eventually affecting the experience of their users and monetization as well. The LSDS-IR'15 workshop will provide space for researchers to discuss the existing performance problems in the context of large-scale and distributed information retrieval systems and define new research directions in the modern Big Data era. The workshop expects to bring together information retrieval practitioners from the industry, as well as academic researchers concerned with any aspect of large-scale and distributed information retrieval systems.
NWSearch 2015: International Workshop on Novel Web Search Interfaces and Systems BIBAFull-Text 1949-1950
  Davood Rafiei; Katsumi Tanaka
Held for the first time in conjunction with the ACM International Conference on Information and Knowledge Management (CIKM), NWSearch 2015 aims to bring together researchers, developers and practitioners who are interested in pushing the search boundary on the Web and exploring more novel forms of searches, interfaces, task formulations, and result organizations and presentations. In particular, the workshop seeks to identify some of the problems and challenges facing the development of such tools and interfaces and to flourish new ideas and findings that can shape or influence future research directions and developments. The workshop organizers solicited contributions that would fall within the large spectrum of human-computer interaction in one extreme and system production and development in the other extreme.
PIKM 2015: The 8th ACM Workshop for Ph.D. Students in Information and Knowledge Management BIBAFull-Text 1951-1952
  Mouna Kacimi; Nicoleta Preda; Maya Ramanath
The PIKM workshop offers Ph.D. students the opportunity to bring their work to an international and interdisciplinary research community, and create a network of young researchers to exchange and develop new and promising ideas. Similar to the CIKM, the PIKM workshop covers a wide range of topics in the areas of databases, information retrieval and knowledge management.
TM 2015 -- Topic Models: Post-Processing and Applications Workshop BIBAFull-Text 1953-1954
  Nikolaos Aletras; Jey Han Lau; Timothy Baldwin; Mark Stevenson
The main objective of the workshop is to bring together researchers who are interested in applications of topic models and improving their output. Our goal is to create a broad platform for researchers to share ideas that could improve the usability and interpretation of topic models. We expect this will promote topic model applications in other research areas, making their use more effective.
UCUI'15: The 1st International Workshop on Understanding the City with Urban Informatics BIBAFull-Text 1955-1956
  Yashar Moshfeghi; Iadh Ounis; Craig Macdonald; Joemon M. Jose; Peter Triantafillou; Mark Livingston; Piyushimita Thakuriah
Urban Informatics aims to exploit the large quantities of information produced by modern cities in order to gain insights into how they function. These insights lay the foundation for improving the lives of citizens, by improving the efficacy and efficiency of public services, and satisfying complex information needs arising within this context. The goal of the workshop is to provide a multidisciplinary forum which brings together researchers in Big Data (BD), Information Retrieval (IR), Data Mining, and Urban Studies, to explore novel solutions to the numerous theoretical, practical and ethical challenges arising in this context. These include difficulties in collecting city data, creating data management infrastructures, and providing new effective and efficient information access techniques to as many users as possible in the context of a smart city. To foster the development of new BD and IR approaches in Urban Informatics, the workshop makes available a representative dataset of city data, including Internet-based visual (Flickr) and textual (Tweets and News) media collections. The workshop provides enormous opportunities for data scientists who wish to understand the complexities of working with city data, conduct innovative research within Urban Informatics, and build a long-term community in this emerging research area.