HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 0506070809101112131415

Proceedings of the 2015 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:Ricardo Baeza-Yates; Mounia Lalmas; Alistair Moffat; Berthier Ribeiro-Neto
Location:Santiago, Chile
Dates:2015-Aug-09 to 2015-Aug-13
Standard No:ISBN: 978-1-4503-3621-5; ACM DL: Table of Contents; hcibib: IR15
Links:Conference Website
  1. Salton Award
  2. Session 1A: Assisting the Search
  3. Session 1B: Multimedia
  4. Session 1C: Efficient Algorithms
  5. Session 2A: Diversity and Bias
  6. Session 1B: Queries
  7. Session 2C: Graphs
  8. Session 3A: Search Experience
  9. Session 3B: Social Media
  10. Session 3C: Entities
  11. Session 4A: User Models
  12. Session 4B: Recommending
  13. Session 4C: Classifying & Ranking
  14. Session 5A: Deep Learning
  15. Session 5B: Products
  16. Session 5C: Locations
  17. Session 6A: Experiment Design
  18. Session 6B: Predicting
  19. Session 6C: Tasks and Devices
  20. Keynote
  21. Session 7A: Assessing
  22. Session 7B: Terms
  23. Session 8A: Variability in test collections
  24. Session 8B: Citations
  25. Session 9A: Streams
  26. Session 9B: Cards
  27. Short Papers
  28. Demonstrations
  29. Doctoral Consortium
  30. Industry Track Invited Talks
  31. Industry Track Refereed Papers
  32. Tutorials
  33. Workshops

Salton Award

Salton Award Lecture: People, Interacting with Information BIBAFull-Text 1-2
  Nicholas J. Belkin
Colleagues, friends, let me begin by expressing how pleased, and humbly honored I am to be a recipient of the Gerard Salton Award. Gerry was a great man, and to receive the award named for him is very special. For me personally, it is especially meaningful, given the sometime disputatious nature of our professional interactions, and what seemed, on the surface, to be quite different ideas about information retrieval. I say, on the surface, because in the end, I believe that he and I both shared the same goal for the field, although we approached it from quite different positions. In this presentation, I will speak at some length on that goal, and on how I think it might be best addressed. I am humbled also by the honor of having joined the ranks of the previous recipients of this award; the founders, leaders and innovators in information retrieval (IR), from the earliest beginnings of the field to today. It has been my distinct good fortune to have known all of the previous recipients, to have collaborated with many of them, to have argued with all of them, to have learned from them, and, I hope, to have been able to appropriately incorporate their insights into my own work. Today, following the example of many of my predecessors, I'd like to take this opportunity to, in Sue Dumais's words of 2009, "present a personal reflection on information retrieval." This will include an overview of my history in IR, a discussion of my personal take on its proper goals, and on how those might be best achieved, and saying something about the challenges that IR theory, experiment and practice face both now, and in the future, and how we might best address those challenges. I came to the field of IR from a starting point in information science; specifically, with the concern of addressing general problems of information in society. I initially thought that the best way to do this would be to establish a firm framework for a science of information. Acting on my then understanding of what a science constituted, I began the project of defining information. Fortunately for me, I did this at University College, London, with B.C. Brookes as my Ph.D. supervisor, and Steve Robertson my office mate. It didn't take long for me to be disabused of a) the idea that information could be defined; and b) that defining its phenomena of interest was a necessary precondition to a "real" science. Instead, perhaps under the influence of the great pragmatist, Jeremy Bentham, founder of University College, I turned to attempting to develop a concept of information which would lead to being able to predict its effect on a person's state of knowledge, which I took at that time as what IR systems should be attempting to do.

Session 1A: Assisting the Search

Exploring Session Context using Distributed Representations of Queries and Reformulations BIBAFull-Text 3-12
  Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "San Francisco" -- "San Francisco 49ers" is semantically similar to "Detroit" -- "Detroit Lions". Likewise, "London" -- "things to do in London" and "New York" -- "New York tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" -- "new movies" and "york" -- "New York", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations. Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
An Eye-Tracking Study of Query Reformulation BIBAFull-Text 13-22
  Carsten Eickhoff; Sebastian Dungs; Vu Tran
Information about a user's domain knowledge and interest can be important signals for many information retrieval tasks such as query suggestion or result ranking. State-of-the-art user models rely on coarse-grained representations of the user's previous knowledge about a topic or domain. In this paper, we study query refinement using eye-tracking in order to gain precise and detailed insight into which terms the user was exposed to in a search session and which ones they showed a particular interest in. We measure fixations on the term level, allowing for a detailed model of user attention. To allow for a wide-spread exploitation of our findings, we generalize from the restrictive eye-gaze tracking to using more accessible signals: mouse cursor traces. Based on the public API of a popular search engine, we demonstrate how query suggestion candidates can be ranked according to traces of user attention and interest, resulting in significantly better performance than achieved by an attention-oblivious industry solution. Our experiments suggest that modelling term-level user attention can be achieved with great reliability and holds significant potential for supporting a range of traditional IR tasks.
Differences in the Use of Search Assistance for Tasks of Varying Complexity BIBAFull-Text 23-32
  Robert Capra; Jaime Arguello; Anita Crescenzi; Emily Vardell
In this paper, we study how users interact with a search assistance tool while completing tasks of varying complexity. We designed a novel tool referred to as the search guide (SG) that displays the search trails (queries issued, results clicked, pages bookmarked) from three previous users who completed the task. We report on a laboratory study with 48 participants that investigates different factors that may influence user interaction with the SG and the effects of the SG on different outcome measures. Participants were asked to find and bookmark pages for four tasks of varying complexity and the SG was made available to half the participants. We collected log data and conducted retrospective stimulated recall interviews to learn about participants' use of the SG. Our results suggest the following trends. First, interaction with the SG was greater for more complex tasks. Second, the a priori determinability of the task (i.e., whether the task was perceived to be well-defined) helped predict whether participants gained a bookmark from the SG. Third, participants who interacted with the SG, but did not gain a bookmark, felt less system support than those who gained a bookmark and those who did not interact. Finally, a qualitative analysis of our interviews suggests differences in motivation and benefits from SG use for different levels of task complexity. Our findings extend prior research on search assistance tools and provide insights for the design of systems to help users with complex search tasks.

Session 1B: Multimedia

Dynamic Query Modeling for Related Content Finding BIBAFull-Text 33-42
  Daan Odijk; Edgar Meij; Isaac Sijaranamual; Maarten de Rijke
While watching television, people increasingly consume additional content related to what they are watching. We consider the task of finding video content related to a live television broadcast for which we leverage the textual stream of subtitles associated with the broadcast. We model this task as a Markov decision process and propose a method that uses reinforcement learning to directly optimize the retrieval effectiveness of queries generated from the stream of subtitles. Our dynamic query modeling approach significantly outperforms state-of-the-art baselines for stationary query modeling and for text-based retrieval in a television setting. In particular we find that carefully weighting terms and decaying these weights based on recency significantly improves effectiveness. Moreover, our method is highly efficient and can be used in a live television setting, i.e., in near real time.
Image-Based Recommendations on Styles and Substitutes BIBAFull-Text 43-52
  Julian McAuley; Christopher Targett; Qinfeng Shi; Anton van den Hengel
Humans inevitably develop a sense of the relationships between objects, some of which are based on their appearance. Some pairs of objects might be seen as being alternatives to each other (such as two pairs of jeans), while others may be seen as being complementary (such as a pair of jeans and a matching shirt). This information guides many of the choices that people make, from buying clothes to their interactions with each other. We seek here to model this human sense of the relationships between objects based on their appearance. Our approach is not based on fine-grained modeling of user annotations but rather on capturing the largest dataset possible and developing a scalable method for uncovering human notions of the visual relationships within. We cast this as a network inference problem defined on graphs of related images, and provide a large-scale dataset for the training and evaluation of the same. The system we develop is capable of recommending which clothes and accessories will go well together (and which will not), amongst a host of other applications.
Semi-supervised Hashing with Semantic Confidence for Large Scale Visual Search BIBAFull-Text 53-62
  Yingwei Pan; Ting Yao; Houqiang Li; Chong-Wah Ngo; Tao Mei
Similarity search is one of the fundamental problems for large scale multimedia applications. Hashing techniques, as one popular strategy, have been intensively investigated owing to the speed and memory efficiency. Recent research has shown that leveraging supervised information can lead to high quality hashing. However, most existing supervised methods learn hashing function by treating each training example equally while ignoring the different semantic degree related to the label, i.e. semantic confidence, of different examples. In this paper, we propose a novel semi-supervised hashing framework by leveraging semantic confidence. Specifically, a confidence factor is first assigned to each example by neighbor voting and click count in the scenarios with label and click-through data, respectively. Then, the factor is incorporated into the pairwise and triplet relationship learning for hashing. Furthermore, the two learnt relationships are seamlessly encoded into semi-supervised hashing methods with pairwise and listwise supervision respectively, which are formulated as minimizing empirical error on the labeled data while maximizing the variance of hash bits or minimizing quantization loss over both the labeled and unlabeled data. In addition, the kernelized variant of semi-supervised hashing is also presented. We have conducted experiments on both CIFAR-10 (with label) and Clickture (with click data) image benchmarks (up to one million image examples), demonstrating that our approaches outperform the state-of-the-art hashing techniques.

Session 1C: Efficient Algorithms

Optimal Aggregation Policy for Reducing Tail Latency of Web Search BIBAFull-Text 63-72
  Jeong-Min Yun; Yuxiong He; Sameh Elnikety; Shaolei Ren
A web search engine often employs partition-aggregate architecture, where an aggregator propagates a user query to all index serving nodes (ISNs) and collects the responses from them. An aggregation policy determines how long the aggregators wait for the ISNs before returning aggregated results to users, crucially affecting both query latency and quality. Designing an aggregation policy is, however, challenging: Response latency among queries and among ISNs varies significantly, and aggregators lack of knowledge about when ISNs will respond. In this paper, we propose aggregation policies that minimize tail latency of search queries subject to search quality service level agreements (SLAs), combining data-driven offline analysis with online processing. Beginning with a single aggregator, we formally prove the optimality of our policy: It achieves the offline optimal result without knowing future responses of ISNs. We extend our policy for commonly-used hierarchical levels of aggregators and prove its optimality when messaging times between aggregators are known. We also present an empirically-effective policy to address unknown messaging time. We use production traces from a commercial search engine, a commercial advertisement engine, and synthetic workloads to evaluate the aggregation policy. The results show that compared to prior work, the policy reduces tail latency by up to 40% while satisfying same quality SLAs.
QuickScorer: A Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees BIBAFull-Text 73-82
  Claudio Lucchese; Franco Maria Nardini; Salvatore Orlando; Raffaele Perego; Nicola Tonellotto; Rossano Venturini
Learning-to-Rank models based on additive ensembles of regression trees have proven to be very effective for ranking query results returned by Web search engines, a scenario where quality and efficiency requirements are very demanding. Unfortunately, the computational cost of these ranking models is high. Thus, several works already proposed solutions aiming at improving the efficiency of the scoring process by dealing with features and peculiarities of modern CPUs and memory hierarchies. In this paper, we present QuickScorer, a new algorithm that adopts a novel bitvector representation of the tree-based ranking model, and performs an interleaved traversal of the ensemble by means of simple logical bitwise operations. The performance of the proposed algorithm are unprecedented, due to its cache-aware approach, both in terms of data layout and access patterns, and to a control flow that entails very low branch mis-prediction rates. The experiments on real Learning-to-Rank datasets show that QuickScorer is able to achieve speedups over the best state-of-the-art baseline ranging from 2x to 6.5x.
High Quality Graph-Based Similarity Search BIBAFull-Text 83-92
  Weiren Yu; Julie Ann McCann
SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted "connectivity trait": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a "varied-D" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a "kernel-based" model to improve the quality of SimRank, and circumvent the "connectivity trait" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: "if D is replaced by a scaled identity matrix, top-K rankings will not be affected much". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.

Session 2A: Diversity and Bias

Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes BIBAFull-Text 93-102
  Zhaochun Ren; Maarten de Rijke
Given a topic of interest, a contrastive theme is a group of opposing pairs of viewpoints. We address the task of summarizing contrastive themes: given a set of opinionated documents, select meaningful sentences to represent contrastive themes present in those documents. Several factors make this a challenging problem: unknown numbers of topics, unknown relationships among topics, and the extraction of comparative sentences. Our approach has three core ingredients: contrastive theme modeling, diverse theme extraction, and contrastive theme summarization. Specifically, we present a hierarchical non-parametric model to describe hierarchical relations among topics; this model is used to infer threads of topics as themes from the nested Chinese restaurant process. We enhance the diversity of themes by using structured determinantal point processes for selecting a set of diverse themes with high quality. Finally, we pair contrastive themes and employ an iterative optimization algorithm to select sentences, explicitly considering contrast, relevance, and diversity. Experiments on three datasets demonstrate the effectiveness of our method.
Splitting Water: Precision and Anti-Precision to Reduce Pool Bias BIBAFull-Text 103-112
  Aldo Lipani; Mihai Lupu; Allan Hanbury
For many tasks in evaluation campaigns, especially those modeling narrow domain-specific challenges, lack of participation leads to a potential pooling bias due to the scarce number of pooled runs. It is well known that the reliability of a test collection is proportional to the number of topics and relevance assessments provided for each topic, but also to same extent to the diversity in participation in the challenges. Hence, in this paper we present a new perspective in reducing the pool bias by studying the effect of merging an unpooled run with the pooled runs. We also introduce an indicator used by the bias correction method to decide whether the correction needs to be applied or not. This indicator gives strong clues about the potential of a "good" run tested on an "unfriendly" test collection (i.e. a collection where the pool was contributed to by runs very different from the one at hand). We demonstrate the correctness of our method on a set of fifteen test collections from the Text REtrieval Conference (TREC). We observe a reduction in system ranking error and absolute score difference error.
Learning Maximal Marginal Relevance Model via Directly Optimizing Diversity Evaluation Measures BIBAFull-Text 113-122
  Long Xia; Jun Xu; Yanyan Lan; Jiafeng Guo; Xueqi Cheng
In this paper we address the issue of learning a ranking model for search result diversification. In the task, a model concerns with both query-document relevance and document diversity is automatically created with training data. Ideally a diverse ranking model would be designed to meet the criterion of maximal marginal relevance, for selecting documents that have the least similarity to previously selected documents. Also, an ideal learning algorithm for diverse ranking would train a ranking model that could directly optimize the diversity evaluation measures with respect to the training data. Existing methods, however, either fail to model the marginal relevance, or train ranking models by minimizing loss functions that loosely related to the evaluation measures. To deal with the problem, we propose a novel learning algorithm under the framework of Perceptron, which adopts the ranking model that maximizes marginal relevance at ranking and can optimize any diversity evaluation measure in training. The algorithm, referred to as PAMM (Perceptron Algorithm using Measures as Margins), first constructs positive and negative diverse rankings for each training query, and then repeatedly adjusts the model parameters so that the margins between the positive and negative rankings are maximized. Experimental results on three benchmark datasets show that PAMM significantly outperforms the state-of-the-art baseline methods.

Session 1B: Queries

Analyzing User's Sequential Behavior in Query Auto-Completion via Markov Processes BIBAFull-Text 123-132
  Liangda Li; Hongbo Deng; Anlei Dong; Yi Chang; Hongyuan Zha; Ricardo Baeza-Yates
Query auto-completion (QAC) plays an important role in assisting users typing less while submitting a query. The QAC engine generally offers a list of suggested queries that start with a user's input as a prefix, and the list of suggestions is changed to match the updated input after the user types each keystroke. Therefore rich user interactions can be observed along with each keystroke until a user clicks a suggestion or types the entire query manually. It becomes increasingly important to analyze and understand users' interactions with the QAC engine, to improve its performance. Existing works on QAC either ignored users' interaction data, or assumed that their interactions at each keystroke are independent from others. Our paper pays high attention to users' sequential interactions with a QAC engine in and across QAC sessions, rather than users' interactions at each keystroke of each QAC session separately. Analyzing the dependencies in users' sequential interactions improves our understanding of the following three questions: 1) how is a user's skipping/viewing move at the current keystroke influenced by that at the previous keystroke? 2) how to improve search engines' query suggestions at short keystrokes based on those at latter long keystrokes? and 3) facing a targeted query shown in the suggestion list, why does a user decide to continue typing rather than click the intended suggestion? We propose a probabilistic model that addresses those three questions in a unified way, and illustrate how the model determines users' final click decisions. By comparing with state-of-the-art methods, our proposed model does suggest queries that better satisfy users' intents.
Learning by Example: Training Users with High-quality Query Suggestions BIBAFull-Text 133-142
  Morgan Harvey; Claudia Hauff; David Elsweiler
The queries submitted by users to search engines often poorly describe their information needs and represent a potential bottleneck in the system. In this paper we investigate to what extent it is possible to aid users in learning how to formulate better queries by providing examples of high-quality queries interactively during a number of search sessions. By means of several controlled user studies we collect quantitative and qualitative evidence that shows: (1) study participants are able to identify and abstract qualities of queries that make them highly effective, (2) after seeing high-quality example queries participants are able to themselves create queries that are highly effective, and, (3) those queries look similar to expert queries as defined in the literature. We conclude by discussing what the findings mean in the context of the design of interactive search systems.
adaQAC: Adaptive Query Auto-Completion via Implicit Negative Feedback BIBAFull-Text 143-152
  Aston Zhang; Amit Goyal; Weize Kong; Hongbo Deng; Anlei Dong; Yi Chang; Carl A. Gunter; Jiawei Han
Query auto-completion (QAC) facilitates user query composition by suggesting queries given query prefix inputs. In 2014, global users of Yahoo! Search saved more than 50% keystrokes when submitting English queries by selecting suggestions of QAC. Users' preference of queries can be inferred during user-QAC interactions, such as dwelling on suggestion lists for a long time without selecting query suggestions ranked at the top. However, the wealth of such implicit negative feedback has not been exploited for designing QAC models. Most existing QAC models rank suggested queries for given prefixes based on certain relevance scores.
   We take the initiative towards studying implicit negative feedback during user-QAC interactions. This motivates re-designing QAC in the more general "(static) relevance"(adaptive) implicit negative feedback? framework. We propose a novel adaptive model adaQAC that adapts query auto-completion to users' implicit negative feedback towards unselected query suggestions. We collect user-QAC interaction data and perform large-scale experiments. Empirical results show that implicit negative feedback significantly and consistently boosts the accuracy of the investigated static QAC models that only rely on relevance scores. Our work compellingly makes a key point: QAC should be designed in a more general framework for adapting to implicit negative feedback.

Session 2C: Graphs

A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking BIBAFull-Text 153-162
  Giang Tran; Ata Turk; B. Barla Cambazoglu; Wolfgang Nejdl
Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.
A Similarity Measure for Weaving Patterns in Textiles BIBAFull-Text 163-172
  Sven Helmer; Vuong Minh Ngo
We propose a novel approach for measuring the similarity between weaving patterns that can provide similarity-based search functionality for textile archives. We represent textile structures using hypergraphs and extract multisets of k-neighborhoods from these graphs. The resulting multisets are then compared using Jaccard coefficients, Hamming distances, and cosine measures. We evaluate the different variants of our similarity measure experimentally, showing that it can be implemented efficiently and illustrating its quality using it to cluster and query a data set containing more than a thousand textile samples.
Local Ranking Problem on the BrowseGraph BIBAFull-Text 173-182
  Michele Trevisiol; Luca Maria Aiello; Paolo Boldi; Roi Blanco
The "Local Ranking Problem" (LRP) is related to the computation of a centrality-like rank on a local graph, where the scores of the nodes could significantly differ from the ones computed on the global graph. Previous work has studied LRP on the hyperlink graph but never on the BrowseGraph, namely a graph where nodes are webpages and edges are browsing transitions. Recently, this graph has received more and more attention in many different tasks such as ranking, prediction and recommendation. However, a web-server has only the browsing traffic performed on its pages (local BrowseGraph) and, as a consequence, the local computation can lead to estimation errors, which hinders the increasing number of applications in the state of the art. Also, although the divergence between the local and global ranks has been measured, the possibility of estimating such divergence using only local knowledge has been mainly overlooked. These aspects are of great interest for online service providers who want to: (i) gauge their ability to correctly assess the importance of their resources only based on their local knowledge, and (ii) take into account real user browsing fluxes that better capture the actual user interest than the static hyperlink network. We study the LRP problem on a BrowseGraph from a large news provider, considering as subgraphs the aggregations of browsing traces of users coming from different domains. We show that the distance between rankings can be accurately predicted based only on structural information of the local graph, being able to achieve an average rank correlation as high as 0.8.

Session 3A: Search Experience

How many results per page?: A Study of SERP Size, Search Behavior and User Experience BIBAFull-Text 183-192
  Diane Kelly; Leif Azzopardi
The provision of "ten blue links" has emerged as the standard for the design of search engine result pages (SERPs). While numerous aspects of SERPs have been examined, little attention has been paid to the number of results displayed per page. This paper investigates the relationships among the number of results shown on a SERP, search behavior and user experience. We performed a laboratory experiment with 36 subjects, who were randomly assigned to use one of three search interfaces that varied according to the number of results per SERP (three, six or ten). We found subjects' click distributions differed significantly depending on SERP size. We also found those who interacted with three results per page viewed significantly more SERPs per query; interestingly, the number of SERPs they viewed per query corresponded to about 10 search results. Subjects who interacted with ten results per page viewed and saved significantly more documents. They also reported the greatest difficulty finding relevant documents, rated their skills the lowest and reported greater workload, even though these differences were not significant. This work shows that behavior changes with SERP size, such that more time is spent focused on earlier results when SERP size decreases.
Influence of Vertical Result in Web Search Examination BIBAFull-Text 193-202
  Zeyang Liu; Yiqun Liu; Ke Zhou; Min Zhang; Shaoping Ma
Research in how users examine results on search engine result pages (SERPs) helps improve result ranking, advertisement placement, performance evaluation and search UI design. Although examination behavior on organic search results (also known as "ten blue links") has been well studied in existing works, there lacks a thorough investigation on how users examine SERPs with verticals. Considering the fact that a large fraction of SERPs are served with one or more verticals in the practical Web search scenario, it is of vital importance to understand the influence of vertical results on search examination behaviors. In this paper, we focus on five popular vertical types and try to study their influences on users' examination processes in both cases when they are relevant or irrelevant to the search queries. With examination behavior data collected with an eye-tracking device, we show the existence of vertical-aware user behavior effects including vertical attraction effect, examination cut-off effect in the presence of a relevant vertical, and examination spill-over effect in the presence of an irrelevant vertical. Furthermore, we are also among the first to systematically investigate the internal examination behavior within the vertical results. We believe that this work will promote our understanding of user interactions with federated search engines and bring benefit to the construction of search performance evaluations.
Unconscious Physiological Effects of Search Latency on Users and Their Click Behaviour BIBAFull-Text 203-212
  Miguel Barreda-Ángeles; Ioannis Arapakis; Xiao Bai; B. Barla Cambazoglu; Alexandre Pereda-Baños
Understanding the impact of a search system's response latency on its users' searching behaviour has been recently an active research topic in the information retrieval and human-computer interaction areas. Along the same line, this paper focuses on the user impact of search latency and makes the following two contributions. First, through a controlled experiment, we reveal the physiological effects of response latency on users and show that these effects are present even at small increases in response latency. We compare these effects with the information gathered from self-reports and show that they capture the nuanced attentional and emotional reactions to latency much better. Second, we carry out a large-scale analysis using a web search query log obtained from Yahoo to understand the change in the way users engage with a web search engine under varying levels of increasing response latency. In particular, we analyse the change in the click behaviour of users when they are subject to increasing response latency and reveal significant behavioural differences.

Session 3B: Social Media

Multiple Social Network Learning and Its Application in Volunteerism Tendency Prediction BIBAFull-Text 213-222
  Xuemeng Song; Liqiang Nie; Luming Zhang; Mohammad Akbari; Tat-Seng Chua
We are living in the era of social networks, where people throughout the world are connected and organized by multiple social networks. The views revealed by different social networks may vary according to the different services they offer. They are complimentary to each other and comprehensively characterize a specific user from different perspectives. As compared to the scare knowledge conveyed by a single source, appropriate aggregation of multiple social networks offers us a better opportunity for deep user understanding. The challenges, however, co-exist with opportunities. The first challenge lies in the existence of block-wise missing data, caused by the fact that some users may be very active in certain social networks while inactive in others. The second challenge is how to collaboratively integrate multiple social networks. Towards this end, we first proposed a novel model for data missing completion by seamlessly exploring the knowledge from multiple sources. We then developed a robust multiple social network learning model, and applied it to the application of volunteerism tendency prediction. Extensive experiments on real world dataset verify the effectiveness of our scheme. The proposed scheme is applicable to many other domains, such as demographic inference and interest prediction.
HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research BIBAFull-Text 223-232
  Surendra Sedhai; Aixin Sun
Hashtag facilitates information diffusion in Twitter by creating dynamic and virtual communities for information aggregation from all Twitter users. Because hashtags serve as additional channels for one's tweets to be potentially accessed by other users than her own followers, hashtags are targeted for spamming purposes (e.g., hashtag hijacking), particularly the popular and trending hashtags. Although much effort has been devoted to fighting against email/web spam, limited studies are on hashtag-oriented spam in tweets. In this paper, we collected 14 million tweets that matched some trending hashtags in two months' time and then conducted systematic annotation of the tweets being spam and ham (i.e., non-spam). We name the annotated dataset HSpam14. Our annotation process includes four major steps: (i) heuristic-based selection to search for tweets that are more likely to be spam, (ii) near-duplicate cluster based annotation to firstly group similar tweets into clusters and then label the clusters, (iii) reliable ham tweets detection to label tweets that are non-spam, and (iv) Expectation-Maximization (EM)-based label prediction to predict the labels of remaining unlabeled tweets. One major contribution of this work is the creation of HSpam14 dataset, which can be used for hashtag-oriented spam research in tweets. Another contribution is the observations made from the preliminary analysis of the HSpam14 dataset.
Uncovering Crowdsourced Manipulation of Online Reviews BIBAFull-Text 233-242
  Amir Fayazi; Kyumin Lee; James Caverlee; Anna Squicciarini
Online reviews are a cornerstone of consumer decision making. However, their authenticity and quality has proven hard to control, especially as polluters target these reviews toward promoting products or in degrading competitors. In a troubling direction, the widespread growth of crowdsourcing platforms like Mechanical Turk has created a large-scale, potentially difficult-to-detect workforce of malicious review writers. Hence, this paper tackles the challenge of uncovering crowdsourced manipulation of online reviews through a three-part effort: (i) First, we propose a novel sampling method for identifying products that have been targeted for manipulation and a seed set of deceptive reviewers who have been enlisted through crowdsourcing platforms. (ii) Second, we augment this base set of deceptive reviewers through a reviewer-reviewer graph clustering approach based on a Markov Random Field where we define individual potentials (of single reviewers) and pair potentials (between two reviewers). (iii) Finally, we embed the results of this probabilistic model into a classification framework for detecting crowd-manipulated reviews. We find that the proposed approach achieves up to 0.96 AUC, outperforming both traditional detection methods and a SimRank-based alternative clustering approach.

Session 3C: Entities

Relevance Scores for Triples from Type-Like Relations BIBAFull-Text 243-252
  Hannah Bast; Björn Buchhold; Elmar Haussmann
We compute and evaluate relevance scores for knowledge-base triples from type-like relations. Such a score measures the degree to which an entity "belongs" to a type. For example, Quentin Tarantino has various professions, including Film Director, Screenwriter, and Actor. The first two would get a high score in our setting, because those are his main professions. The third would get a low score, because he mostly had cameo appearances in his own movies. Such scores are essential in the ranking for entity queries, e.g. "American actors" or "Quentin Tarantino professions". These scores are different from scores for "correctness" or "accuracy" (all three professions above are correct and accurate). We propose a variety of algorithms to compute these scores. For our evaluation we designed a new benchmark, which includes a ground truth based on about 14K human judgments obtained via crowdsourcing. Inter-judge agreement is slightly over 90%. Existing approaches from the literature give results far from the optimum. Our best algorithms achieve an agreement of about 80% with the ground truth.
Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data BIBAFull-Text 253-262
  Nikita Zhiltsov; Alexander Kotov; Fedor Nikolaev
Previously proposed approaches to ad-hoc entity retrieval in the Web of Data (ERWD) used multi-fielded representation of entities and relied on standard unigram bag-of-words retrieval models. Although retrieval models incorporating term dependencies have been shown to be significantly more effective than the unigram bag-of-words ones for ad hoc document retrieval, it is not known whether accounting for term dependencies can improve retrieval from the Web of Data. In this work, we propose a novel retrieval model that incorporates term dependencies into structured document retrieval and apply it to the task of ERWD. In the proposed model, the document field weights and the relative importance of unigrams and bigrams are optimized with respect to the target retrieval metric using a learning-to-rank method. Experiments on a publicly available benchmark indicate significant improvement of the accuracy of retrieval results by the proposed model over state-of-the-art retrieval models for ERWD.
Mining, Ranking and Recommending Entity Aspects BIBAFull-Text 263-272
  Ridho Reinanda; Edgar Meij; Maarten de Rijke
Entity queries constitute a large fraction of web search queries and most of these queries are in the form of an entity mention plus some context terms that represent an intent in the context of that entity. We refer to these entity-oriented search intents as entity aspects. Recognizing entity aspects in a query can improve various search applications such as providing direct answers, diversifying search results, and recommending queries. In this paper we focus on the tasks of identifying, ranking, and recommending entity aspects, and propose an approach that mines, clusters, and ranks such aspects from query logs. We perform large-scale experiments based on users' search sessions from actual query logs to evaluate the aspect ranking and recommendation tasks. In the aspect ranking task, we aim to satisfy most users' entity queries, and evaluate this task in a query-independent fashion. We find that entropy-based methods achieve the best performance compared to maximum likelihood and language modeling approaches. In the aspect recommendation task, we recommend other aspects related to the aspect currently being queried. We propose two approaches based on semantic relatedness and aspect transitions within user sessions and find that a combined approach gives the best performance. As an additional experiment, we utilize entity aspects for actual query recommendation and find that our approach improves the effectiveness of query recommendations built on top of the query-flow graph.

Session 4A: User Models

Bayesian Ranker Comparison Based on Historical User Interactions BIBAFull-Text 273-282
  Artem Grotov; Shimon Whiteson; Maarten de Rijke
We address the problem of how to safely compare rankers for information retrieval. In particular, we consider how to control the risks associated with switching from an existing production ranker to a new candidate ranker. Whereas existing online comparison methods require showing potentially suboptimal result lists to users during the comparison process, which can lead to user frustration and abandonment, our approach only requires user interaction data generated through the natural use of the production ranker. Specifically, we propose a Bayesian approach for (1) comparing the production ranker to candidate rankers and (2) estimating the confidence of this comparison. The comparison of rankers is performed using click model-based information retrieval metrics, while the confidence of the comparison is derived from Bayesian estimates of uncertainty in the underlying click model. These confidence estimates are then used to determine whether a risk-averse decision criterion for switching to the candidate ranker has been satisfied. Experimental results on several learning to rank datasets and on a click log show that the proposed approach outperforms an existing ranker comparison method that does not take uncertainty into account.
Incorporating Non-sequential Behavior into Click Models BIBAFull-Text 283-292
  Chao Wang; Yiqun Liu; Meng Wang; Ke Zhou; Jian-yun Nie; Shaoping Ma
Click-through information is considered as a valuable source of users' implicit relevance feedback. As user behavior is usually influenced by a number of factors such as position, presentation style and site reputation, researchers have proposed a variety of assumptions (i.e. click models) to generate a reasonable estimation of result relevance. The construction of click models usually follow some hypotheses. For example, most existing click models follow the sequential examination hypothesis in which users examine results from top to bottom in a linear fashion. While these click models have been successful, many recent studies showed that there is a large proportion of non-sequential browsing (both examination and click) behaviors in Web search, which the previous models fail to cope with. In this paper, we investigate the problem of properly incorporating non-sequential behavior into click models. We firstly carry out a laboratory eye-tracking study to analyze user's non-sequential examination behavior and then propose a novel click model named Partially Sequential Click Model (PSCM) that captures the practical behavior of users. We compare PSCM with a number of existing click models using two real-world search engine logs. Experimental results show that PSCM outperforms other click models in terms of both predicting click behavior (perplexity) and estimating result relevance (NDCG and user preference test). We also publicize the implementations of PSCM and related datasets for possible future comparison studies.
Untangling Result List Refinement and Ranking Quality: a Framework for Evaluation and Prediction BIBAFull-Text 293-302
  Jiyin He; Marc Bron; Arjen de Vries; Leif Azzopardi; Maarten de Rijke
Traditional batch evaluation metrics assume that user interaction with search results is limited to scanning down a ranked list. However, modern search interfaces come with additional elements supporting result list refinement (RLR) through facets and filters, making user search behavior increasingly dynamic. We develop an evaluation framework that takes a step beyond the interaction assumption of traditional evaluation metrics and allows for batch evaluation of systems with and without RLR elements. In our framework we model user interaction as switching between different sublists. This provides a measure of user effort based on the joint effect of user interaction with RLR elements and result quality. We validate our framework by conducting a user study and comparing model predictions with real user performance. Our model predictions show significant positive correlation with real user effort. Further, in contrast to traditional evaluation metrics, the predictions using our framework, of when users stand to benefit from RLR elements, reflect findings from our user study.
   Finally, we use the framework to investigate under what conditions systems with and without RLR elements are likely to be effective. We simulate varying conditions concerning ranking quality, users, task and interface properties demonstrating a cost-effective way to study whole system performance.

Session 4B: Recommending

WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation BIBAFull-Text 303-312
  Chao Chen; Dongsheng Li; Yingying Zhao; Qin Lv; Li Shang
Matrix approximation is one of the most effective methods for collaborative filtering-based recommender systems. However, the high computation complexity of matrix factorization on large datasets limits its scalability. Prior solutions have adopted co-clustering methods to partition a large matrix into a set of smaller submatrices, which can then be processed in parallel to improve scalability. The drawback is that the recommendation accuracy is lower as the submatrices only contain subsets of the user-item rating information. This paper presents WEMAREC, a weighted and ensemble matrix approximation method for accurate and scalable recommendation. It builds upon the intuition that (sub)matrices containing more frequent samples of certain user/item/rating tend to make more reliable rating predictions for these specific user/item/rating. WEMAREC consists of two important components: (1) a weighting strategy that is computed based on the rating distribution in each submatrix and applied to approximate a single matrix containing those submatrices; and (2) an ensemble strategy that leverages user-specific and item-specific rating distributions to combine the approximation matrices of multiple sets of co-clustering results. Evaluations using real-world datasets demonstrate that WEMAREC outperforms state-of-the-art matrix approximation methods in recommendation accuracy (0.5?11.9% on the MovieLens dataset and 2.2-13.1% on the Netflix dataset) with 3-10X improvement on scalability.
Effective Latent Models for Binary Feedback in Recommender Systems BIBAFull-Text 313-322
  Maksims Volkovs; Guang Wei Yu
In many collaborative filtering (CF) applications, latent approaches are the preferred model choice due to their ability to generate real-time recommendations efficiently. However, the majority of existing latent models are not designed for implicit binary feedback (views, clicks, plays etc.) and perform poorly on data of this type. Developing accurate models from implicit feedback is becoming increasingly important in CF since implicit feedback can often be collected at lower cost and in much larger quantities than explicit preferences. The need for accurate latent models for implicit data was further emphasized by the recently conducted Million Song Dataset Challenge organized by Kaggle. In this challenge, the results for the best latent model were orders of magnitude worse than neighbor-based approaches, and all the top performing teams exclusively used neighbor-based models. We address this problem and propose a new latent approach for binary feedback in CF. In our model, neighborhood similarity information is used to guide latent factorization and derive accurate latent representations. We show that even with simple factorization methods like SVD, our approach outperforms existing models and produces state-of-the-art results.
Personalized Recommendation via Parameter-Free Contextual Bandits BIBAFull-Text 323-332
  Liang Tang; Yexi Jiang; Lei Li; Chunqiu Zeng; Tao Li
Personalized recommendation services have gained increasing popularity and attention in recent years as most useful information can be accessed online in real-time. Most online recommender systems try to address the information needs of users by virtue of both user and content information. Despite extensive recent advances, the problem of personalized recommendation remains challenging for at least two reasons. First, the user and item repositories undergo frequent changes, which makes traditional recommendation algorithms ineffective. Second, the so-called cold-start problem is difficult to address, as the information for learning a recommendation model is limited for new items or new users. Both challenges are formed by the dilemma of exploration and exploitation. In this paper, we formulate personalized recommendation as a contextual bandit problem to solve the exploration/exploitation dilemma. Specifically in our work, we propose a parameter-free bandit strategy, which employs a principled resampling approach called online bootstrap, to derive the distribution of estimated models in an online manner. Under the paradigm of probability matching, the proposed algorithm randomly samples a model from the derived distribution for every recommendation. Extensive empirical experiments on two real-world collections of web data (including online advertising and news recommendation) demonstrate the effectiveness of the proposed algorithm in terms of the click-through rate. The experimental results also show that this proposed algorithm is robust in the cold-start situation, in which there is no sufficient data or knowledge to tune the hyper-parameters.

Session 4C: Classifying & Ranking

An Efficient and Scalable MetaFeature-based Document Classification Approach based on Massively Parallel Computing BIBAFull-Text 333-342
  Sérgio Canuto; Marcos Gonçalves; Wisllay Santos; Thierson Rosa; Wellington Martins
The unprecedented growth of available data nowadays has stimulated the development of new methods for organizing and extracting useful knowledge from this immense amount of data. Automatic Document Classification (ADC) is one of such methods, that uses machine learning techniques to build models capable of automatically associating documents to well-defined semantic classes. ADC is the basis of many important applications such as language identification, sentiment analysis, recommender systems, spam filtering, among others. Recently, the use of meta-features has been shown to substantially improve the effectiveness of ADC algorithms. In particular, the use of meta-features that make a combined use of local information (through kNN-based features) and global information (through category centroids) has produced promising results. However, the generation of these meta-features is very costly in terms of both, memory consumption and runtime since there is the need to constantly call the kNN algorithm. We take advantage of the current manycore GPU architecture and present a massively parallel version of the kNN algorithm for highly dimensional and sparse datasets (which is the case for ADC). Our experimental results show that we can obtain speedup gains of up to 15x while reducing memory consumption in more than 5000x when compared to a state-of-the-art parallel baseline. This opens up the possibility of applying meta-features based classification in large collections of documents, that would otherwise take too much time or require the use of an expensive computational platform.
Listwise Collaborative Filtering BIBAFull-Text 343-352
  Shanshan Huang; Shuaiqiang Wang; Tie-Yan Liu; Jun Ma; Zhumin Chen; Jari Veijalainen
Recently, ranking-oriented collaborative filtering (CF) algorithms have achieved great success in recommender systems. They obtained state-of-the-art performances by estimating a preference ranking of items for each user rather than estimating the absolute ratings on unrated items (as conventional rating-oriented CF algorithms do). In this paper, we propose a new ranking-oriented CF algorithm, called ListCF. Following the memory-based CF framework, ListCF directly predicts a total order of items for each user based on similar users' probability distributions over permutations of the items, and thus differs from previous ranking-oriented memory-based CF algorithms that focus on predicting the pairwise preferences between items. One important advantage of ListCF lies in its ability of reducing the computational complexity of the training and prediction procedures while achieving the same or better ranking performances as compared to previous ranking-oriented memory-based CF algorithms. Extensive experiments on three benchmark datasets against several state-of-the-art baselines demonstrate the effectiveness of our proposal.
BROOF: Exploiting Out-of-Bag Errors, Boosting and Random Forests for Effective Automated Classification BIBAFull-Text 353-362
  Thiago Salles; Marcos Gonçalves; Victor Rodrigues; Leonardo Rocha
Random Forests (RF) and Boosting are two of the most successful supervised learning paradigms for automatic classification. In this work we propose to combine both strategies in order to exploit their strengths while simultaneously solving some of their drawbacks, especially when applied to high-dimensional and noisy classification tasks. More specifically, we propose a boosted version of the RF classifier (BROOF), which fits an additive model composed by several random forests (as weak learners). Differently from traditional boosting methods which exploit the training error estimate, we here use the stronger out-of-bag (OOB) error estimate which is an out-of-the-box estimate naturally produced by the bagging method used in RFs. The influence of each weak learner in the fitted additive model is inversely proportional to their OOB error. Moreover, the probability of selecting an out-of-bag training example is increased if misclassified by the simpler weak learners, in order to enable the boosted model to focus on complex regions of the input space. We also adopt a selective weight updating procedure, whereas only the out-of-bag examples are updated as the boosting iterations go by. This serves the purpose of slowing down the tendency to focus on just a few hard-to-classify examples. By mitigating this undesired bias known to affect boosting algorithms under high dimensional and noisy scenarios -- due to both the selective weighting schema and a proper weak-learner effectiveness assessment -- we greatly improve classification effectiveness. Our experiments with several datasets in three representative high-dimensional and noisy domains -- topic, sentiment and microarray data classification -- an up to ten state-of-the-art classifiers (covering almost 500 results), show that BROOF is the only classifier to be among the top performers in all tested datasets from the topic classification domain, and in the vast majority of cases in sentiment and microarray domains, a surprising result given the knowledge that there is no single top-notch classifier for all datasets.

Session 5A: Deep Learning

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings BIBAFull-Text 363-372
  Ivan Vulic; Marie-Francine Moens
We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).
Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks BIBAFull-Text 373-382
  Aliaksei Severyn; Alessandro Moschitti
Learning a similarity function between pairs of objects is at the core of learning to rank approaches. In information retrieval tasks we typically deal with query-document pairs, in question answering -- question-answer pairs. However, before learning can take place, such pairs needs to be mapped from the original space of symbolic words into some feature space encoding various aspects of their relatedness, e.g. lexical, syntactic and semantic. Feature engineering is often a laborious task and may require external knowledge sources that are not always available or difficult to obtain. Recently, deep learning approaches have gained a lot of attention from the research community and industry for their ability to automatically learn optimal feature representation for a given task, while claiming state-of-the-art performance in many tasks in computer vision, speech recognition and natural language processing. In this paper, we present a convolutional neural network architecture for reranking pairs of short texts, where we learn the optimal representation of text pairs and a similarity function to relate them in a supervised way from the available training data. Our network takes only words in the input, thus requiring minimal preprocessing. In particular, we consider the task of reranking short text pairs where elements of the pair are sentences. We test our deep learning system on two popular retrieval tasks from TREC: Question Answering and Microblog Retrieval. Our model demonstrates strong performance on the first task beating previous state-of-the-art systems by about 3% absolute points in both MAP and MRR and shows comparable results on tweet reranking, while enjoying the benefits of no manual feature engineering and no additional syntactic parsers.
Context- and Content-aware Embeddings for Query Rewriting in Sponsored Search BIBAFull-Text 383-392
  Mihajlo Grbovic; Nemanja Djuric; Vladan Radosavljevic; Fabrizio Silvestri; Narayan Bhamidipati
Search engines represent one of the most popular web services, visited by more than 85% of internet users on a daily basis. Advertisers are interested in making use of this vast business potential, as very clear intent signal communicated through the issued query allows effective targeting of users. This idea is embodied in a sponsored search model, where each advertiser maintains a list of keywords they deem indicative of increased user response rate with regards to their business. According to this targeting model, when a query is issued all advertisers with a matching keyword are entered into an auction according to the amount they bid for the query, and the winner gets to show their ad. One of the main challenges is the fact that a query may not match many keywords, resulting in lower auction value, lower ad quality, and lost revenue for advertisers and publishers. Possible solution is to expand a query into a set of related queries and use them to increase the number of matched ads, called query rewriting. To this end, we propose rewriting method based on a novel query embedding algorithm, which jointly models query content as well as its context within a search session. As a result, queries with similar content and context are mapped into vectors close in the embedding space, which allows expansion of a query via simple K-nearest neighbor search in the projected space. The method was trained on more than 12 billion sessions, one of the largest corpuses reported thus far, and evaluated on both public TREC data set and in-house sponsored search data set. The results show the proposed approach significantly outperformed existing state-of-the-art, strongly indicating its benefits and the monetization potential.

Session 5B: Products

Retrieval of Relevant Opinion Sentences for New Products BIBAFull-Text 393-402
  Dae Hoon Park; Hyun Duk Kim; ChengXiang Zhai; Lifan Guo
With the rapid development of Internet and E-commerce, abundant product reviews have been written by consumers who bought the products. These reviews are very useful for consumers to optimize their purchasing decisions. However, since the reviews are all written by consumers who have bought and used a product, there are generally very few or even no reviews available for a new product or an unpopular product. We study the novel problem of retrieving relevant opinion sentences from the reviews of other products using specifications of a new or unpopular product as query. Our key idea is to leverage product specifications to assess product similarity between the query product and other products and extract relevant opinion sentences from the similar products where a consumer may find useful discussions. Then, we provide ranked opinion sentences for the query product that has no user-generated reviews. We first propose a popular summarization method and its modified version to solve the problem. Then, we propose our novel probabilistic methods. Experiment results show that the proposed methods can effectively retrieve useful opinion sentences for products that have no reviews.
Learning Hierarchical Representation Model for NextBasket Recommendation BIBAFull-Text 403-412
  Pengfei Wang; Jiafeng Guo; Yanyan Lan; Jun Xu; Shengxian Wan; Xueqi Cheng
Next basket recommendation is a crucial task in market basket analysis. Given a user's purchase history, usually a sequence of transaction data, one attempts to build a recommender that can predict the next few items that the user most probably would like. Ideally, a good recommender should be able to explore the sequential behavior (i.e., buying one item leads to buying another next), as well as account for users' general taste (i.e., what items a user is typically interested in) for recommendation. Moreover, these two factors may interact with each other to influence users' next purchase. To tackle the above problems, in this paper, we introduce a novel recommendation approach, namely hierarchical representation model (HRM). HRM can well capture both sequential behavior and users' general taste by involving transaction and user representations in prediction. Meanwhile, the flexibility of applying different aggregation operations, especially nonlinear operations, on representations allows us to model complicated interactions among different factors. Theoretically, we show that our model subsumes several existing methods when choosing proper aggregation operations. Empirically, we demonstrate that our model can consistently outperform the state-of-the-art baselines under different evaluation metrics on real-world transaction data.
Parametric and Non-parametric User-aware Sentiment Topic Models BIBAFull-Text 413-422
  Zaihan Yang; Alexander Kotov; Aravind Mohan; Shiyong Lu
The popularity of Web 2.0 has resulted in a large number of publicly available online consumer reviews created by a demographically diverse user base. Information about the authors of these reviews, such as age, gender and location, provided by many on-line consumer review platforms may allow companies to better understand the preferences of different market segments and improve their product design, manufacturing processes and marketing campaigns accordingly. However, previous work in sentiment analysis has largely ignored these additional user meta-data. To address this deficiency, in this paper, we propose parametric and non-parametric User-aware Sentiment Topic Models (USTM) that incorporate demographic information of review authors into topic modeling process in order to discover associations between market segments, topical aspects and sentiments. Qualitative examination of the topics discovered using USTM framework in the two datasets collected from popular online consumer review platforms as well as quantitative evaluation of the methods utilizing those topics for the tasks of review sentiment classification and user attribute prediction both indicate the utility of accounting for demographic information of review authors in opinion mining.

Session 5C: Locations

Learning to Extract Local Events from the Web BIBAFull-Text 423-432
  John Foley; Michael Bendersky; Vanja Josifovski
The goal of this work is extraction and retrieval of local events from web pages. Examples of local events include small venue concerts, theater performances, garage sales, movie screenings, etc. We collect these events in the form of retrievable calendar entries that include structured information about event name, date, time and location. Between existing information extraction techniques and the availability of information on social media and semantic web technologies, there are numerous ways to collect commercial, high-profile events. However, most extraction techniques require domain-level supervision, which is not attainable at web scale. Similarly, while the adoption of the semantic web has grown, there will always be organizations without the resources or the expertise to add machine-readable annotations to their pages. Therefore, our approach bootstraps these explicit annotations to massively scale up local event extraction.
   We propose a novel event extraction model that uses distant supervision to assign scores to individual event fields (event name, date, time and location) and a structural algorithm to optimally group these fields into event records. Our model integrates information from both the entire source document and its relevant sub-regions, and is highly scalable.
   We evaluate our extraction model on all 700 million documents in a large publicly available web corpus, ClueWeb12. Using the 217,000 unique explicitly annotated events as distant supervision, we are able to double recall with 85% precision and quadruple it with 65% precision, with no additional human supervision. We also show that our model can be bootstrapped for a fully supervised approach, which can further improve the precision by 30%.
   In addition, we evaluate the geographic coverage of the extracted events. We find that there is a significant increase in the geo-diversity of extracted events compared to existing explicit annotations, while maintaining high precision levels.
Rank-GeoFM: A Ranking based Geographical Factorization Method for Point of Interest Recommendation BIBAFull-Text 433-442
  Xutao Li; Gao Cong; Xiao-Li Li; Tuan-Anh Nguyen Pham; Shonali Krishnaswamy
With the rapid growth of location-based social networks, Point of Interest (POI) recommendation has become an important research problem. However, the scarcity of the check-in data, a type of implicit feedback data, poses a severe challenge for existing POI recommendation methods. Moreover, different types of context information about POIs are available and how to leverage them becomes another challenge. In this paper, we propose a ranking based geographical factorization method, called Rank-GeoFM, for POI recommendation, which addresses the two challenges. In the proposed model, we consider that the check-in frequency characterizes users' visiting preference and learn the factorization by ranking the POIs correctly. In our model, POIs both with and without check-ins will contribute to learning the ranking and thus the data sparsity problem can be alleviated. In addition, our model can easily incorporate different types of context information, such as the geographical influence and temporal influence. We propose a stochastic gradient descent based algorithm to learn the factorization. Experiments on publicly available datasets under both user-POI setting and user-time-POI setting have been conducted to test the effectiveness of the proposed method. Experimental results under both settings show that the proposed method outperforms the state-of-the-art methods significantly in terms of recommendation accuracy.
GeoSoCa: Exploiting Geographical, Social and Categorical Correlations for Point-of-Interest Recommendations BIBAFull-Text 443-452
  Jia-Dong Zhang; Chi-Yin Chow
Recommending users with their preferred points-of-interest (POIs), e.g., museums and restaurants, has become an important feature for location-based social networks (LBSNs), which benefits people to explore new places and businesses to discover potential customers. However, because users only check in a few POIs in an LBSN, the user-POI check-in interaction is highly sparse, which renders a big challenge for POI recommendations. To tackle this challenge, in this study we propose a new POI recommendation approach called GeoSoCa through exploiting geographical correlations, social correlations and categorical correlations among users and POIs. The geographical, social and categorical correlations can be learned from the historical check-in data of users on POIs and utilized to predict the relevance score of a user to an unvisited POI so as to make recommendations for users. First, in GeoSoCa we propose a kernel estimation method with an adaptive bandwidth to determine a personalized check-in distribution of POIs for each user that naturally models the geographical correlations between POIs. Then, GeoSoCa aggregates the check-in frequency or rating of a user's friends on a POI and models the social check-in frequency or rating as a power-law distribution to employ the social correlations between users. Further, GeoSoCa applies the bias of a user on a POI category to weigh the popularity of a POI in the corresponding category and models the weighed popularity as a power-law distribution to leverage the categorical correlations between POIs. Finally, we conduct a comprehensive performance evaluation for GeoSoCa using two large-scale real-world check-in data sets collected from Foursquare and Yelp. Experimental results show that GeoSoCa achieves significantly superior recommendation quality compared to other state-of-the-art POI recommendation techniques.

Session 6A: Experiment Design

Optimised Scheduling of Online Experiments BIBAFull-Text 453-462
  Eugene Kharitonov; Craig Macdonald; Pavel Serdyukov; Iadh Ounis
Modern search engines increasingly rely on online evaluation methods such as A/B tests and interleaving. These online evaluation methods make use of interactions by the search engine's users to test various changes in the search engine. However, since the number of the user sessions per unit of time is limited, the number of simultaneously running on-line evaluation experiments is bounded. In an extreme case, it might be impossible to deploy all experiments since they arrive faster than are processed. Consequently, it is very important to efficiently use the limited resource of the user's interactions. In this paper, we formulate the novel problem of schedule optimisation for the queue of the online experiments: given a limited number of the user interactions available for experimentation, we want to re-order the queue so that the number of successful experiments is maximised. In order to build a schedule optimisation algorithm, we start by formulating a model of an online experimentation pipeline. Next, we propose to reduce the task of finding the optimal schedule to a learning-to-rank problem, where we require the most promising experiments to be ranked first in the schedule. To evaluate the proposed approach, we perform an evaluation study using two datasets containing 82 interleaving and 35 A/B test experiments, performed by a commercial search engine. We measure the quality of a schedule as the number of successful experiments executed under the limited resource of the user interactions. Our proposed schedulers obtain improvements of up to 342% compared to the un-optimised baseline schedule on the dataset of interleaving experiments and up to 43% on the dataset of A/B tests.
Predicting Search Satisfaction Metrics with Interleaved Comparisons BIBAFull-Text 463-472
  Anne Schuth; Katja Hofmann; Filip Radlinski
The gold standard for online retrieval evaluation is AB testing. Rooted in the idea of a controlled experiment, AB tests compare the performance of an experimental system (treatment) on one sample of the user population, to that of a baseline system (control) on another sample. Given an online evaluation metric that accurately reflects user satisfaction, these tests enjoy high validity. However, due to the high variance across users, these comparisons often have low sensitivity, requiring millions of queries to detect statistically significant differences between systems. Interleaving is an alternative online evaluation approach, where each user is presented with a combination of results from both the control and treatment systems. Compared to AB tests, interleaving has been shown to be substantially more sensitive. However, interleaving methods have so far focused on user clicks only, and lack support for more sophisticated user satisfaction metrics as used in AB testing. In this paper we present the first method for integrating user satisfaction metrics with interleaving. We show how interleaving can be extended to (1) directly match user signals and parameters of AB metrics, and (2) how parameterized interleaving credit functions can be automatically calibrated to predict AB outcomes. We also develop a new method for estimating the relative sensitivity of interleaving and AB metrics, and show that our interleaving credit functions improve agreement with AB metrics without sacrificing sensitivity. Our results, using 38 large-scale online experiments encompassing over 3 billion clicks in a web search setting, demonstrate up to a 22% improvement in agreement with AB metrics (constituting over a 50% error reduction), while maintaining sensitivity of one to two orders of magnitude above the AB tests. This paves the way towards more sensitive and accurate online evaluation.
Sequential Testing for Early Stopping of Online Experiments BIBAFull-Text 473-482
  Eugene Kharitonov; Aleksandr Vorobev; Craig Macdonald; Pavel Serdyukov; Iadh Ounis
Online evaluation methods, such as A/B and interleaving experiments, are widely used for search engine evaluation. Since they rely on noisy implicit user feedback, running each experiment takes a considerable time. Recently, the problem of reducing the duration of online experiments has received substantial attention from the research community. However, the possibility of using sequential statistical testing procedures for reducing the time required for the evaluation experiments remains less studied. Such sequential testing procedures allow an experiment to stop early, once the data collected is sufficient to make a conclusion. In this work, we study the usefulness of sequential testing procedures for both interleaving and A/B testing. We propose modified versions of the O'Brien & Fleming and MaxSPRT sequential tests that are applicable for testing in the interleaving scenario. Similarly, for A/B experiments, we assess the usefulness of the O'Brien & Fleming test, as well as that of our proposed MaxSPRT-based sequential testing procedure. In our experiments on datasets containing 115 interleaving and 41 A/B testing experiments, we observe that considerable reductions in the average experiment duration can be achieved by using our proposed tests. In particular, for A/B experiments, the average experiment durations can be reduced by up to 66% in comparison with a single step test procedure, and by up to 44% in comparison with the O'Brien & Fleming test. Similarly, a marked relative reduction of 63% in the duration of the interleaving experiments can be achieved.

Session 6B: Predicting

Inferring Searcher Attention by Jointly Modeling User Interactions and Content Salience BIBAFull-Text 483-492
  Dmitry Lagun; Eugene Agichtein
Modeling and predicting user attention is crucial for interpreting search behavior. The numerous applications include quantifying web search satisfaction, estimating search quality, and measuring and predicting online user engagement. While prior research has demonstrated the value of mouse cursor data and other interactions as a rough proxy of user attention, precisely predicting where a user is looking on a page remains a challenge, exacerbated in Web pages beyond the traditional search results. To improve attention prediction on a wider variety of Web pages, we propose a new way of modeling searcher behavior data by connecting the user interactions to the underlying Web page content. Specifically, we propose a principled model for predicting a searcher's gaze position on a page, that we call Mixture of Interactions and Content Salience (MICS). To our knowledge, our model is the first to effectively combine user interaction data, such as mouse cursor and scrolling positions, with the visual prominence, or salience, of the page content elements. Extensive experiments on multiple popular types of Web content demonstrate that the proposed MICS model significantly outperforms previous approaches to searcher gaze prediction that use only the interaction information. Grounding the observed interactions to the underlying page content provides a general and robust approach to user attention modeling, enabling more powerful tool for search behavior interpretation and ultimately search quality improvements.
Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information BIBAFull-Text 493-502
  Yiqun Liu; Ye Chen; Jinhui Tang; Jiashen Sun; Min Zhang; Shaoping Ma; Xuan Zhu
Satisfaction prediction is one of the prime concerns in search performance evaluation. It is a non-trivial task for two major reasons: (1) The definition of satisfaction is rather subjective and different users may have different opinions in satisfaction judgement. (2) Most existing studies on satisfaction prediction mainly rely on users' click-through or query reformulation behaviors but there are many sessions without such kind of interactions. To shed light on these research questions, we construct an experimental search engine that could collect users' satisfaction feedback as well as mouse click-through/movement data. Different from existing studies, we compare for the first time search users' and external assessors' opinions on satisfaction. We find that search users pay more attention to the utility of results while external assessors emphasize on the efforts spent in search sessions. Inspired by recent studies in predicting result relevance based on mouse movement patterns (namely motifs), we propose to estimate the utilities of search results and the efforts in search sessions with motifs extracted from mouse movement data on search result pages (SERPs). Besides the existing frequency-based motif selection method, two novel selection strategies (distance-based and distribution-based) are also adopted to extract high quality motifs for satisfaction prediction. Experimental results on over 1,000 user sessions show that the proposed strategies outperform existing methods and also have promising generalization capability for different users and queries.
Predicting Search Intent Based on Pre-Search Context BIBAFull-Text 503-512
  Weize Kong; Rui Li; Jie Luo; Aston Zhang; Yi Chang; James Allan
While many studies have been conducted on query understanding, there is limited understanding on why users start searches and how to predict search intent. In this paper, we propose to study this important but less explored problem. Our key intuition is that searches are triggered by different pre-search contexts, but the triggering relations are often hidden. For example, a user may search "bitcoin" because of a news article or an email the user just read, but the system does not know which of the pre-search contexts (the news article or the email) is the triggering source. Following this intuition, we conduct an in-depth analysis of pre-search context on a large-scale user log, which not only verifies the hidden triggering relations in the real world but also identifies a set of important characteristics of pre-search context and their triggered queries. Since the hidden triggering relations make it challenging to directly use pre-search context for intent prediction, we develop a mixture generative model to learn without any supervision how queries are triggered by different types of pre-search context. Further, we discuss how to apply our model to improve query prediction and query auto-completion. Our experiments on a large-scale of real-world data show that our model could accurately predict user search intent with pre-search context and improve upon the state-of-the-art methods significantly.

Session 6C: Tasks and Devices

Leveraging Procedural Knowledge for Task-oriented Search BIBAFull-Text 513-522
  Zi Yang; Eric Nyberg
Many search engine users attempt to satisfy an information need by issuing multiple queries, with the expectation that each result will contribute some portion of the required information. Previous research has shown that structured or semi-structured descriptive knowledge bases (such as Wikipedia) can be used to improve search quality and experience for general or entity-centric queries. However, such resources do not have sufficient coverage of procedural knowledge, i.e. what actions should be performed and what factors should be considered to achieve some goal; such procedural knowledge is crucial when responding to task-oriented search queries. This paper provides a first attempt to bridge the gap between two evolving research areas: development of procedural knowledge bases (such as wikiHow) and task-oriented search. We investigate whether task-oriented search can benefit from existing procedural knowledge (search task suggestion) and whether automatic procedural knowledge construction can benefit from users' search activities (automatic procedural knowledge base construction). We propose to create a three-way parallel corpus of queries, query contexts, and task descriptions, and reduce both problems to sequence labeling tasks. We propose a set of textual features and structural features to identify key search phrases from task descriptions, and then adapt similar features to extract wikiHow-style procedural knowledge descriptions from search queries and relevant text snippets. We compare our proposed solution with baseline algorithms, commercial search engines, and the (manually-curated) wikiHow procedural knowledge; experimental results show an improvement of +0.28 to +0.41 in terms of Precision@8 and mean average precision (MAP).
Personalizing Search on Shared Devices BIBAFull-Text 523-532
  Ryen W. White; Ahmed Hassan Awadallah
Search personalization tailors the search experience to individual searchers. To do this, search engines construct interest models comprising signals from observed behavior associated with ma-chines, often via Web browser cookies or other user identifiers. However, shared device usage is common, meaning that the activities of multiple searchers may be interwoven in the interest models generated. Recent research on activity attribution has led to methods to automatically disentangle the histories of multiple searchers and correctly ascribe newly-observed search activity to the correct per-son. Building on this, we introduce attribution-based personalization (ABP), a procedure that extends traditional personalization to target individual searchers on shared devices. Activity attribution may improve personalization, but its benefits are not yet fully understood. We present an oracle study (with perfect knowledge of which searchers perform each action on each machine) to under-stand the effectiveness of ABP in predicting searchers' future interests. We utilize a large Web search log dataset containing both per-son identifiers and machine identifiers to quantify the gain in personalization performance from ABP, identify the circumstances under which ABP is most effective, and develop a classifier to determine when to apply it that yields sizable gains in personalization performance. ABP allows search providers to personalize experiences for individuals rather than targeting all users of a device collectively.
Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval BIBAFull-Text 533-542
  Dae Hoon Park; Mengwen Liu; ChengXiang Zhai; Haohong Wang
Smartphones and tablets with their apps pervaded our everyday life, leading to a new demand for search tools to help users find the right apps to satisfy their immediate needs. While there are a few commercial mobile app search engines available, the new task of mobile app retrieval has not yet been rigorously studied. Indeed, there does not yet exist a test collection for quantitatively evaluating this new retrieval task. In this paper, we first study the effectiveness of the state-of-the-art retrieval models for the app retrieval task using a new app retrieval test data we created. We then propose and study a novel approach that generates a new representation for each app. Our key idea is to leverage user reviews to find out important features of apps and bridge vocabulary gap between app developers and users. Specifically, we jointly model app descriptions and user reviews using topic model in order to generate app representations while excluding noise in reviews. Experiment results indicate that the proposed approach is effective and outperforms the state-of-the-art retrieval models for app retrieval.


Towards a Game-Theoretic Framework for Information Retrieval BIBAFull-Text 543
  ChengXiang Zhai
The task of information retrieval (IR) has traditionally been defined as to rank a collection of documents in response to a query. While this definition has enabled most research progress in IR so far, it does not model accurately the actual retrieval task in a real IR application, where users tend to be engaged in an interactive process with multiple queries, and optimizing the overall performance of an IR system on an entire search session is far more important than its performance on an individual query. In this talk, I will present a new game-theoretic formulation of the IR problem where the key idea is to model information retrieval as a process of a search engine and a user playing a cooperative game, with a shared goal of satisfying the user's information need (or more generally helping the user complete a task) while minimizing the user's effort and the resource overhead on the retrieval system. Such a game-theoretic framework offers several benefits. First, it naturally suggests optimization of the overall utility of an interactive retrieval system over a whole search session, thus breaking the limitation of the traditional formulation that optimizes ranking of documents for a single query. Second, it models the interactions between users and a search engine, and thus can optimize the collaboration of a search engine and its users, maximizing the "combined intelligence" of a system and users. Finally, it can serve as a unified framework for optimizing both interactive information retrieval and active relevance judgment acquisition through crowdsourcing. I will discuss how the new framework can not only cover several emerging directions in current IR research as special cases, but also open up many interesting new research directions in IR.

Session 7A: Assessing

Representative & Informative Query Selection for Learning to Rank using Submodular Functions BIBAFull-Text 545-554
  Rishabh Mehrotra; Emine Yilmaz
The performance of Learning to Rank algorithms strongly depend on the number of labelled queries in the training set, while the cost incurred in annotating a large number of queries with relevance judgements is prohibitively high. As a result, constructing such a training dataset involves selecting a set of candidate queries for labelling. In this work, we investigate query selection strategies for learning to rank aimed at actively selecting unlabelled queries to be labelled so as to minimize the data annotation cost. %total number of labelled queries -- without degrading the ranking performance. In particular, we characterize query selection based on two aspects of informativeness and representativeness and propose two novel query selection strategies (i) Permutation Probability based query selection and (ii) Topic Model based query selection which capture the two aspects, respectively. We further argue that an ideal query selection strategy should take into account both these aspects and as our final contribution, we present a submodular objective that couples both these aspects while selecting query subsets. We evaluate the quality of the proposed strategies on three real world learning to rank datasets and show that the proposed query selection methods results in significant performance gains compared to the existing state-of-the-art approaches.
Impact of Surrogate Assessments on High-Recall Retrieval BIBAFull-Text 555-564
  Adam Roegiest; Gordon V. Cormack; Charles L. A. Clarke; Maura R. Grossman
We are concerned with the effect of using a surrogate assessor to train a passive (i.e., batch) supervised-learning method to rank documents for subsequent review, where the effectiveness of the ranking will be evaluated using a different assessor deemed to be authoritative. Previous studies suggest that surrogate assessments may be a reasonable proxy for authoritative assessments for this task. Nonetheless, concern persists in some application domains -- such as electronic discovery -- that errors in surrogate training assessments will be amplified by the learning method, materially degrading performance. We demonstrate, through a re-analysis of data used in previous studies, that, with passive supervised-learning methods, using surrogate assessments for training can substantially impair classifier performance, relative to using the same deemed-authoritative assessor for both training and assessment. In particular, using a single surrogate to replace the authoritative assessor for training often yields a ranking that must be traversed much lower to achieve the same level of recall as the ranking that would have resulted had the authoritative assessor been used for training. We also show that steps can be taken to mitigate, and sometimes overcome, the impact of surrogate assessments for training: relevance assessments may be diversified through the use of multiple surrogates; and, a more liberal view of relevance can be adopted by having the surrogate label borderline documents as relevant. By taking these steps, rankings derived from surrogate assessments can match, and sometimes exceed, the performance of the ranking that would have been achieved, had the authority been used for training. Finally, we show that our results still hold when the role of surrogate and authority are interchanged, indicating that the results may simply reflect differing conceptions of relevance between surrogate and authority, as opposed to the authority having special skill or knowledge lacked by the surrogate.
The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation BIBAFull-Text 565-574
  Andrew Turpin; Falk Scholer; Stefano Mizzaro; Eddy Maddalena
Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents in the context of information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting more than 50,000 magnitude estimation judgments. Our analysis shows that on average magnitude estimation judgments are rank-aligned with ordinal judgments made by expert relevance assessors. An advantage of magnitude estimation is that users can chose their own scale for judgments, allowing deeper investigations of user perceptions than when categorical scales are used.
   We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance, in terms of how impactful relative differences in document relevance are perceived to be. We further use magnitude estimation to investigate gain profiles, comparing the currently assumed linear and exponential approaches with actual user-reported relevance perceptions. This indicates that the currently used exponential gain profiles in nDCG and ERR are mismatched with an average user, but perhaps more importantly that individual perceptions are highly variable. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold. Finally, we demonstrate that magnitude estimation judgments can be reliably collected using crowdsourcing, and are competitive in terms of assessor cost.

Session 7B: Terms

Learning to Reweight Terms with Distributed Representations BIBAFull-Text 575-584
  Guoqing Zheng; Jamie Callan
Term weighting is a fundamental problem in IR research and numerous weighting models have been proposed. Proper term weighting can greatly improve retrieval accuracies, which essentially involves two types of query understanding: interpreting the query and judging the relative contribution of the terms to the query. These two steps are often dealt with separately, and complicated yet not so effective weighting strategies are proposed. In this paper, we propose to address query interpretation and term weighting in a unified framework built upon distributed representations of words from recent advances in neural network language modeling. Specifically, we represent term and query as vectors in the same latent space, construct features for terms using their word vectors and learn a model to map the features onto the defined target term weights. The proposed method is simple yet effective. Experiments using four collections and two retrieval models demonstrates significantly higher retrieval accuracies than baseline models.
A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution BIBAFull-Text 585-594
  Jiaul H. Paik
The main goal of a retrieval model is to measure the degree of relevance of a document with respect to the given query. Probabilistic models are widely used to measure the likelihood of relevance of a document by combining within document term frequency and term specificity in a formal way. Recent research shows that tf normalization that factors in multiple aspects of term salience is an effective scheme. However, existing models do not fully utilize these tf normalization components in a principled way. Moreover, most state of the art models ignore the distribution of a term in the part of the collection that contains the term. In this article, we introduce a new probabilistic model of ranking that addresses the above issues. We argue that, since the relevance of a document increases with the frequency of the query term, this assumption can be used to measure the likelihood that the normalized frequency of a term in a particular document will be maximum with respect to its distribution in the elite set. Thus, the weight of a term in a document is proportional to the probability that the normalized frequency of that term is maximum under the hypothesis that the frequencies are generated randomly. To that end, we introduce a ranking function based on maximum value distribution that uses two aspects of tf normalization. The merit of the proposed model is demonstrated on a number of recent large web collections. Results show that the proposed model outperforms the state of the art models by significantly large margin.
Non-Compositional Term Dependence for Information Retrieval BIBAFull-Text 595-604
  Christina Lioma; Jakob Grue Simonsen; Birger Larsen; Niels Dalum Hansen
Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly identified using lexical frequency statistics, assuming that (a) if terms co-occur often enough in some corpus, they are semantically dependent; (b) the more often they co-occur, the more semantically dependent they are. This assumption is not always correct: the frequency of co-occurring terms can be separate from the strength of their semantic dependence. E.g. "red tape" might be overall less frequent than "tape measure" in some corpus, but this does not mean that "red"+"tape" are less dependent than "tape"+"measure". This is especially the case for non-compositional phrases, i.e. phrases whose meaning cannot be composed from the individual meanings of their terms (such as the phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction between the frequency and strength of term dependence in IR, we present a principled approach for handling term dependence in queries, using both lexical frequency and semantic evidence. We focus on non-compositional phrases, extending a recent unsupervised model for their detection (Kiela & Clark 2013) to IR. Our approach, integrated into ranking using Markov Random Fields (Metzler & Croft 2005), yields effectiveness gains over competitive TREC baselines, showing that there is still room for improvement in the very well-studied area of term dependence in IR.

Session 8A: Variability in test collections

On the Relation Between Assessor's Agreement and Accuracy in Gamified Relevance Assessment BIBAFull-Text 605-614
  Olga Megorskaya; Vladimir Kukushkin; Pavel Serdyukov
Expert judgments (labels) are widely used in Information Retrieval for the purposes of search quality evaluation and machine learning. Setting up the process of collecting such judgments is a challenge of its own, and the maintenance of judgments quality is an extremely important part of the process. One of the possible ways of controlling the quality is monitoring inter-assessor agreement level. But does the agreement level really reflect the quality of assessor's judgments? Indeed, if a group of assessors comes to a consensus, to what extent should we trust their collective opinion? In this paper, we investigate, whether the agreement level can be used as a metric for estimating the quality of assessor's judgments, and provide recommendations for the design of judgments collection workflow. Namely, we estimate the correlation between assessors' accuracy and agreement in the scope of several workflow designs and investigate which specific workflow features influence the accuracy of judgments the most.
Assessor Differences and User Preferences in Tweet Timeline Generation BIBAFull-Text 615-624
  Yulu Wang; Garrick Sherman; Jimmy Lin; Miles Efron
In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system's goal is to construct an informative summary of non-redundant tweets that addresses the user's information need. Central to the evaluation methodology is human-generated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.
User Variability and IR System Evaluation BIBAFull-Text 625-634
  Peter Bailey; Alistair Moffat; Falk Scholer; Paul Thomas
Test collection design eliminates sources of user variability to make statistical comparisons among information retrieval (IR) systems more affordable. Does this choice unnecessarily limit generalizability of the outcomes to real usage scenarios? We explore two aspects of user variability with regard to evaluating the relative performance of IR systems, assessing effectiveness in the context of a subset of topics from three TREC collections, with the embodied information needs categorized against three levels of increasing task complexity. First, we explore the impact of widely differing queries that searchers construct for the same information need description. By executing those queries, we demonstrate that query formulation is critical to query effectiveness. The results also show that the range of scores characterizing effectiveness for a single system arising from these queries is comparable or greater than the range of scores arising from variation among systems using only a single query per topic. Second, our experiments reveal that searchers display substantial individual variation in the numbers of documents and queries they anticipate needing to issue, and there are underlying significant differences in these numbers in line with increasing task complexity levels. Our conclusion is that test collection design would be improved by the use of multiple query variations per topic, and could be further improved by the use of metrics which are sensitive to the expected numbers of useful documents.

Session 8B: Citations

An Entity Class-Dependent Discriminative Mixture Model for Cumulative Citation Recommendation BIBAFull-Text 635-644
  Jingang Wang; Dandan Song; Qifan Wang; Zhiwei Zhang; Luo Si; Lejian Liao; Chin-Yew Lin
This paper studies Cumulative Citation Recommendation (CCR) for Knowledge Base Acceleration (KBA). The CCR task aims to detect potential citations of a set of target entities with priorities from a volume of temporally-ordered stream corpus. Previous approaches for CCR that build an individual relevance model for each entity fail to handle unseen entities without annotation. A baseline solution is to build a global entity-unspecific model for all entities regardless of the relationship information among entities, which cannot guarantee to achieve satisfactory result for each entity. In this paper, we propose a novel entity class-dependent discriminative mixture model by introducing a latent entity class layer to model the correlations between entities and latent entity classes. The model can better adjust to different types of entities and achieve better performance when dealing with a broad range of entities. An extensive set of experiments has been conducted on TREC-KBA-2013 dataset, and the experimental results demonstrate that the proposed model can achieve the state-of-the-art performance.
Scientific Information Understanding via Open Educational Resources (OER) BIBAFull-Text 645-654
  Xiaozhong Liu; Zhuoren Jiang; Liangcai Gao
Scientific publication retrieval/recommendation has been investigated in the past decade. However, to the best of our knowledge, few efforts have been made to help junior scholars and graduate students to understand and consume the essence of those scientific readings. This paper proposes a novel learning/reading environment, OER-based Collaborative PDF Reader (OCPR), that incorporates innovative scaffolding methods that can: 1. auto-characterize student emerging information need while reading a paper; and 2. enable students to readily access open educational resources (OER) based on their information need. By using metasearch methods, we pre-indexed 1,112,718 OERs, including presentation videos, slides, algorithm source code, or Wikipedia pages, for 41,378 STEM publications. Based on the computational information need, we use text mining and heterogeneous graph mining algorithms to recommend high quality OERs to help students better understand the scientific content in the paper. Evaluation results and exit surveys for an information retrieval course show that the OCPR system alone with the recommended OERs can effectively assist graduate students better understand the complex STEM publications. For instance, 78.42% of participants believe the OCPR system and recommended OERs can provide precise and useful information they need, while 78.43% of them believe the recommended OERs are close to exactly what they need when reading the paper. From OER ranking viewpoint, MRR, MAP and NDCG results prove that learning to rank and cold start solutions can efficiently integrate different text and graph ranking features.
In Situ Insights BIBAFull-Text 655-664
  Yuanhua Lv; Ariel Fuxman
When consuming content in applications such as e-readers, word processors, and Web browsers, users often see mentions to topics (or concepts) that attract their attention. In a scenario of significant practical interest, topics are explored in situ, without leaving the context of the application: The user selects a mention of a topic (in the form of continuous text), and the system subsequently recommends references (e.g., Wikipedia concepts) that are relevant in the context of the application. In order to realize this experience, it is necessary to tackle challenges that include: users may select any continuous text, even potentially noisy text for which there is no corresponding reference in the knowledge base; references must be relevant to both the user selection and the text around it; and the real estate available on the application may be constrained, thus limiting the number of results that can be shown.
   In this paper, we study this novel recommendation task, that we call in situ insights: recommending reference concepts in response to a text selection and its context in-situ of a document consumption application. We first propose a selection-centric context language model and a selection-centric context semantic model to capture user interest. Based on these models, we then measure the quality of a reference concept across three aspects: selection clarity, context coherence, and concept relevance. By leveraging all these aspects, we put forward a machine learning approach to simultaneously decide if a selection is noisy, and filter out low-quality candidate references. In order to quantitatively evaluate our proposed techniques, we construct a test collection based on the simulation of the in situ insights scenario using crowdsourcing in the context of a real-word e-reader application. Our experimental evaluation demonstrates the effectiveness of the proposed techniques.

Session 9A: Streams

Islands in the Stream: A Study of Item Recommendation within an Enterprise Social Stream BIBAFull-Text 665-674
  Ido Guy; Roy Levin; Tal Daniel; Ella Bolshinsky
Social streams allow users to receive updates from their network by syndicating social media activity. These streams have become a popular way to share and consume information both on the web and in the enterprise. With so much activity going on, filtering and personalizing the stream for individual users is a key challenge. In this work, we study the recommendation of enterprise social stream items through a user survey with 510 participants, conducted within a globally distributed organization. In the survey, participants rated their level of interest and surprise for different items from the stream and could also indicate whether they were already familiar with the item. Thus, our evaluation goes beyond the common accuracy measure and examines aspects of serendipity and novelty. We also inspect how various features of the recommended item, its author, and reader, influence its ratings. Our results shed light on the key factors that make a stream item valuable to its reader within the enterprise.
Evaluating Streams of Evolving News Events BIBAFull-Text 675-684
  Gaurav Baruah; Mark D. Smucker; Charles L. A. Clarke
People track news events according to their interests and available time. For a major event of great personal interest, they might check for updates several times an hour, taking time to keep abreast of all aspects of the evolving event. For minor events of more marginal interest, they might check back once or twice a day for a few minutes to learn about the most significant developments. Systems generating streams of updates about evolving events can improve user performance by appropriately filtering these updates, making it easy for users to track events in a timely manner without undue information overload. Unfortunately, predicting user performance on these systems poses a significant challenge. Standard evaluation methodology, designed for Web search and other adhoc retrieval tasks, adapts poorly to this context. In this paper, we develop a simple model that simulates users checking the system from time to time to read updates. For each simulated user, we generate a trace of their activities alternating between away times and reading times. These traces are then applied to measure system effectiveness. We test our model using data from the TREC 2013 Temporal Summarization Track (TST) comparing it to the effectiveness measures used in that track. The primary TST measure corresponds most closely with a modeled user that checks back once a day on average for an average of one minute. Users checking more frequently for longer times may view the relative performance of participating systems quite differently. In light of this sensitivity to user behavior, we recommend that future experiments be built around clearly stated assumptions regarding user interfaces and access patterns, with effectiveness measures reflecting these assumptions.

Session 9B: Cards

Information Retrieval as Card Playing: A Formal Model for Optimizing Interactive Retrieval Interface BIBAFull-Text 685-694
  Yinan Zhang; Chengxiang Zhai
We propose a novel formal model for optimizing interactive information retrieval interfaces. To model interactive retrieval in a general way, we frame the task of an interactive retrieval system as to choose a sequence of interface cards to present to the user. At each interaction lap, the system's goal is to choose an interface card that can maximize the expected gain of relevant information for the user while minimizing the effort of the user with consideration of the user's action model and any desired constraints on the interface card. We show that such a formal interface card model can not only cover the Probability Ranking Principle for Interactive Information Retrieval as a special case by making multiple simplification assumptions, but also be used to derive a novel formal interface model for adaptively optimizing navigational interfaces in a retrieval system. Experimental results show that the proposed model is effective in automatically generating adaptive navigational interfaces, which outperform the baseline pre-designed static interfaces.
From Queries to Cards: Re-ranking Proactive Card Recommendations Based on Reactive Search History BIBAFull-Text 695-704
  Milad Shokouhi; Qi Guo
The growing accessibility of mobile devices has substantially reformed the way users access information. While the reactive search by query remains as common as before, recent years have witnessed the emergence of various proactive systems such as Google Now and Microsoft Cortana. In these systems, relevant content is presented to users based on their context without a query. Interestingly, despite the increasing popularity of such services, there is very little known about how users interact with them.
   In this paper, we present the first study on user interactions with information cards. We demonstrate that the usage patterns of these cards vary depending on time and location. We also show that while overall different topics are clicked by users on proactive and reactive platforms, the topics of the clicked documents by the same user tend to be consistent cross-platform. Furthermore, we propose a supervised framework for re-ranking proactive cards based on the user's context and past history. To train our models, we use the viewport duration and clicks to infer pseudo-relevance labels for the cards. Our results suggest that the quality of card ranking can be significantly improved particularly when the user's reactive search history is %leveraged and matched against the proactive data about the cards.

Short Papers

Using Sensor Metadata Streams to Identify Topics of Local Events in the City BIBAFull-Text 711-714
  M-Dyaa Albakour; Craig Macdonald; Iadh Ounis
In this paper, we study the emerging Information Retrieval (IR) task of local event retrieval using sensor metadata streams. Sensor metadata streams include information such as the crowd density from video processing, audio classifications, and social media activity. We propose to use these metadata streams to identify the topics of local events within a city, where each event topic corresponds to a set of terms representing a type of events such as a concert or a protest. We develop a supervised approach that is capable of mapping sensor metadata observations to an event topic. In addition to using a variety of sensor metadata observations about the current status of the environment as learning features, our approach incorporates additional background features to model cyclic event patterns. Through experimentation with data collected from two locations in a major Spanish city, we show that our approach markedly outperforms an alternative baseline. We also show that modelling background information improves event topic identification.
StarSum: A Simple Star Graph for Multi-document Summarization BIBAFull-Text 715-718
  Mohammed Al-Dhelaan
Graph-based approaches for multi-document summarization have been widely used to extract top sentences for a summary. Traditionally, the documents' cluster is modeled as a graph of the cluster's sentences only which might limit the ability of recognizing topically discriminative sentences in regard to other clusters. In this paper, we propose StarSum a star bipartite graph which models sentences and their topic signature phrases. The approach ensures sentence similarity and content importance from the graph structure. We extract sentences in an approach that guarantees diversity and coverage which are crucial for multi-document summarization. Regardless of the simplicity of the approach in ranking, a DUC experiment shows the effectiveness of StarSum compared to different baselines.
When Relevance Judgement is Happening?: An EEG-based Study BIBAFull-Text 719-722
  Marco Allegretti; Yashar Moshfeghi; Maria Hadjigeorgieva; Frank E. Pollick; Joemon M. Jose; Gabriella Pasi
Relevance is a central notion in Information Retrieval, but it is considered to be a difficult concept to define. We analyse brain signals for the first 800 milliseconds (ms) of a relevance assessment process to answer the question "when relevance is happening in the brain?" with the belief that it will lead to better operational definitions of relevance. For this purpose, we devised a user study in which we captured the brain response of 20 participants. Using a 64-channel EEG device, we measured the electrophysiological activity of the brain while the subjects were in the phase of giving an explicit judgement about the relevance of presented images according to a given topic. Analyses were then performed over different time windows of the recorded EEG signals using repeated measures ANOVA. Data reveal significant variation between relevance and non-relevance within the EEG signals from the presentation of the image to 800 milliseconds afterwards. At an early stage these differences were located at frontal and posterior electrode sites. However, at later stages these differences were located in central, centro-parietal and centro-frontal areas.Our findings are an important step towards (i) a better understanding of the concept of relevance and (ii) a more effective implicit feedback systems.
Search Engine Evaluation based on Search Engine Switching Prediction BIBAFull-Text 723-726
  Olga Arkhipova; Lidia Grauer; Igor Kuralenok; Pavel Serdyukov
In this paper we present a novel application of the search engine switching prediction model for online evaluation. We propose a new metric pSwitch for A/B-testing, which allows us to evaluate the quality of search engines in different aspects such as the quality of the user interface and the quality of the ranking function. pSwitch is a search session-level metric, which relies on the predicted probability that the session contains a switch to another search engine and reflects the degree of the failure of the session. We demonstrate the effectiveness and validity of pSwitch using A/B-testing experiments with real users of search engine Yandex. We compare our metric with recently proposed SpU (sessions per user) metric and other widely used query-level A/B metrics, such as Abandonment Rate and Time to First Click, which we used as our baseline metrics. We observed that pSwitch metric is more sensitive in comparison with those baseline metrics and also that pSwitch and SpU are more consistent with ground truth, than Abandonment Rate and Time to First Click.
Time-Aware Authorship Attribution for Short Text Streams BIBAFull-Text 727-730
  Hosein Azarbonyad; Mostafa Dehghani; Maarten Marx; Jaap Kamps
Identifying authors of short texts on Internet or social media based communication systems is an important tool against fraud and cybercrimes. Besides the challenges raised by the limited length of these short messages, evolving language and writing styles of authors of these texts makes authorship attribution difficult. Most current short text authorship attribution approaches only address the challenge of limited text length. However, neglecting the second challenge may lead to poor performance of authorship attribution for authors who change their writing styles.
   In this paper, we analyse the temporal changes of word usage by authors of tweets and emails and based on this analysis we propose an approach to estimate the dynamicity of authors' word usage. The proposed approach is inspired by time-aware language models and can be employed in any time-unaware authorship attribution method. Our experiments on Tweets and the Enron email dataset show that the proposed time-aware authorship attribution approach significantly outperforms baselines that neglect the dynamicity of authors.
A Priori Relevance Based On Quality and Diversity Of Social Signals BIBAFull-Text 731-734
  Ismail Badache; Mohand Boughanem
Social signals (users' actions) associated with web resources (documents) can be considered as an additional information that can play a role to estimate a priori importance of the resource. In this paper, we are particularly interested in: first, showing the impact of signals diversity associated to a resource on information retrieval performance; second, studying the influence of their social networks origin on their quality. We propose to model these social features as prior that we integrate into language model. We evaluated the effectiveness of our approach on IMDb dataset containing 167438 resources and their social signals collected from several social networks. Our experimental results are statistically significant and show the interest of integrating signals diversity in the retrieval process.
Document Comprehensiveness and User Preferences in Novelty Search Tasks BIBAFull-Text 735-738
  Ashraf Bah; Praveen Chandar; Ben Carterette
Different users may be attempting to satisfy different information needs while providing the same query to a search engine. Addressing that issue is addressing Novelty and Diversity in information retrieval. Novelty and Diversity search task models the task wherein users are interested in seeing more and more documents that are not only relevant, but also cover more aspects (or subtopics) related to the topic of interest. This is in contrast with the traditional IR task where topical relevance is the only factor in evaluating search results. In this paper, we conduct a user study where users are asked to give a preference between one of two documents B and C given a query and also given that they have already seen a document A. We then test a total of ten hypotheses pertaining to the relationship between the "comprehensiveness" of documents (i.e. the number of subtopics a document is relevant to) and real users' preference judgments. Our results show that users are inclined to prefer documents with higher comprehensiveness, even when the prior document A already covers more aspects than the two documents being compared, and even when the least preferred has a higher relevance grade. In fact, users are inclined to prefer documents with higher overall aspect-coverage even in cases where B and C are relevant to the same number of novel subtopics.
Cost-Aware Result Caching for Meta-Search Engines BIBAFull-Text 739-742
  Emre Bakkal; Ismail Sengor Altingovde; Ismail Hakki Toroslu
Our goal in this paper is to design cost-aware result caching approaches for meta-search engines. We introduce different levels of eviction, namely, query-, resource- and entry-level, based on the granularity of the entries to be evicted from the cache when it is full. We also propose a novel entry-level caching approach that is tailored for the meta-search scenario and superior to alternative approaches.
From Unlabelled Tweets to Twitter-specific Opinion Words BIBAFull-Text 743-746
  Felipe Bravo-Marquez; Eibe Frank; Bernhard Pfahringer
In this article, we propose a word-level classification model for automatically generating a Twitter-specific opinion lexicon from a corpus of unlabelled tweets. The tweets from the corpus are represented by two vectors: a bag-of-words vector and a semantic vector based on word-clusters. We propose a distributional representation for words by treating them as the centroids of the tweet vectors in which they appear. The lexicon generation is conducted by training a word-level classifier using these centroids to form the instance space and a seed lexicon to label the training instances. Experimental results show that the two types of tweet vectors complement each other in a statistically significant manner and that our generated lexicon produces significant improvements for tweet-level polarity classification.
The Best Published Result is Random: Sequential Testing and its Effect on Reported Effectiveness BIBAFull-Text 747-750
  Ben Carterette
Reusable test collections allow researchers to rapidly test different algorithms to find the one that works "best". But because of randomness in the topic sample, or in relevance judgments, or in interactions among system components, extreme results can be seen entirely due to chance, particularly when a collection becomes very popular. We argue that the best known published effectiveness on any given collection could be measured as much as 20% higher than its "true" intrinsic effectiveness, and that there are many other systems with lower measured effectiveness that could have substantially higher intrinsic effectiveness.
Load-sensitive CPU Power Management for Web Search Engines BIBAFull-Text 751-754
  Matteo Catena; Craig Macdonald; Nicola Tonellotto
Web search engine companies require power-hungry data centers with thousands of servers to efficiently perform searches on a large scale. This permits the search engines to serve high arrival rates of user queries with low latency, but poses economical and environmental concerns due to the power consumption of the servers. Existing power saving techniques sacrifice the raw performance of a server for reduced power absorption, by scaling the frequency of the server's CPU according to its utilization. For instance, current Linux kernels include frequency governors i.e., mechanisms designed to dynamically throttle the CPU operational frequency. However, such general-domain techniques work at the operating system level and have no knowledge about the querying operations of the server. In this work, we propose to delegate CPU power management to search engine-specific governors. These can leverage knowledge coming from the querying operations, such as the query server utilization and load. By exploiting such additional knowledge, we can appropriately throttle the CPU frequency thereby reducing the query server power consumption. Experiments are conducted upon the TREC ClueWeb09 corpus and the query stream from the MSN 2006 query log. Results show that we can reduce up to 24% a server power consumption, with only limited drawbacks in effectiveness w.r.t. a system running at maximum CPU frequency to promote query processing quality.
Retrieval from Noisy E-Discovery Corpus in the Absence of Training Data BIBAFull-Text 755-758
  Anirban Chakraborty; Kripabandhu Ghosh; Swapan Kumar Parui
OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose a novel algorithm for detecting OCR errors and improving retrieval performance on an E-Discovery corpus. Our contribution is two-fold: (1) identifying erroneous variants of query terms for improvement in retrieval performance, and (2) presenting a scope for a possible error-modelling in the erroneous corpus where clean ground truth text is not available for comparison. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is blank space. The proposed approach obtained statistically significant improvements in recall over state-of-the-art baselines.
Opinion Spammer Detection in Web Forum BIBAFull-Text 759-762
  Yu-Ren Chen; Hsin-Hsi Chen
In this paper, a real case study on opinion spammer detection in web forum is presented. We explore user profiles, maximum spamicity of first posts of users, burstiness of registration of user accounts, and frequent poster set to build a model with SVM with RBF kernel and frequent itemset mining. The proposed model achieves 0.6753 precision, 0.6190 recall, and 0.6460 F1 score. The result is promising because the ratio of opinion spammers in the test set is only 0.98%.
Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review BIBAFull-Text 763-766
  Gordon V. Cormack; Maura R. Grossman
Continuous active learning achieves high recall for technology-assisted review, not only for an overall information need, but also for various facets of that information need, whether explicit or implicit. Through simulations using Cormack and Grossman's TAR Evaluation Toolkit (SIGIR 2014), we show that continuous active learning, applied to a multi-faceted topic, efficiently achieves high recall for each facet of the topic. Our results assuage the concern that continuous active learning may achieve high overall recall at the expense of excluding identifiable categories of relevant information.
Time Pressure and System Delays in Information Search BIBAFull-Text 767-770
  Anita Crescenzi; Diane Kelly; Leif Azzopardi
We report preliminary results of the impact of time pressure and system delays on search behavior from a laboratory study with forty-three participants. To induce time pressure, we randomly assigned half of our study participants to a treatment condition where they were only allowed five minutes to search for each of four ad-hoc search topics. The other half of the participants were given no task time limits. For half of participants' search tasks (n=2), five second delays were introduced after queries were submitted and SERP results were clicked. Results showed that participants in the time pressure condition queried at a significantly higher rate, viewed significantly fewer documents per query, had significantly shallower hover and view depths, and spent significantly less time examining documents and SERPs. We found few significant differences in search behavior for system delay or interaction effects between time pressure and system delay. These initial results show time pressure has a significant impact on search behavior and suggest the design of search interfaces and features that support people who are searching under time pressure.
How Random Decisions Affect Selective Distributed Search BIBAFull-Text 771-774
  Zhuyun Dai; Yubin Kim; Jamie Callan
Selective distributed search is a retrieval architecture that reduces search costs by partitioning a corpus into topical shards such that only a few shards need to be searched for each query. Prior research created topical shards by using random seed documents to cluster a random sample of the full corpus. The resource selection algorithm might use a different random sample of the corpus. These random components make selective search non-deterministic. This paper studies how these random components affect experimental results. Experiments on two ClueWeb09 corpora and four query sets show that in spite of random components, selective search is stable for most queries.
Comparing Approaches for Query Autocompletion BIBAFull-Text 775-778
  Giovanni Di Santo; Richard McCreadie; Craig Macdonald; Iadh Ounis
Within a search engine, query auto-completion aims to predict the final query the user wants to enter as they type, with the aim of reducing query entry time and potentially preparing the search results in advance of query submission. There are a large number of approaches to automatically rank candidate queries for the purposes of auto-completion. However, no study exists that compares these approaches on a single dataset. Hence, in this paper, we present a comparison study between current approaches to rank candidate query completions for the user query as it is typed. Using a query-log and document corpus from a commercial medical search engine, we study the performance of 11 candidate query ranking approaches from the literature and analyze where they are effective. We show that the most effective approaches to query auto-completion are largely dependent on the number of characters that the user has typed so far, with the most effective approach differing for short and long prefixes. Moreover, we show that if personalized information is available about the searcher, this additional information can be used to more effectively rank query candidate completions, regardless of the prefix length.
Sign-Aware Periodicity Metrics of User Engagement for Online Search Quality Evaluation BIBAFull-Text 779-782
  Alexey Drutsa
Modern Internet companies improve evaluation criteria of their data-driven decision-making that is based on online controlled experiments (also known as A/B tests). The amplitude metrics of user engagement are known to be well sensitive to service changes, but they could not be used to determine, whether the treatment effect is positive or negative. We propose to overcome this sign-agnostic issue by paying attention to the phase of the corresponding DFT sine wave. We refine the amplitude metrics of the first frequency by the phase ones and formalize our intuition in several novel overall evaluation criteria. These criteria are then verified over A/B experiments on real users of Yandex. We find that our approach holds the sensitivity level of the amplitudes and makes their changes sign-aware w.r.t. the treatment effect.
Modelling Term Dependence with Copulas BIBAFull-Text 783-786
  Carsten Eickhoff; Arjen P. de Vries; Thomas Hofmann
Many generative language and relevance models assume conditional independence between the likelihood of observing individual terms. This assumption is obviously naive, but also hard to replace or relax. There are only very few term pairs that actually show significant conditional dependencies while the vast majority of co-located terms has no implications on the document's topical nature or relevance towards a given topic. It is exactly this situation that we capture in a formal framework: A limited number of meaningful dependencies in a system of largely independent observations. Making use of the formal copula framework, we describe the strength of causal dependency in terms of a number of established term co-occurrence metrics. Our experiments based on the well known ClueWeb'12 corpus and TREC 2013 topics indicate significant performance gains in terms of retrieval performance when we formally account for the dependency structure underlying pieces of natural language text.
Modeling Website Topic Cohesion at Scale to Improve Webpage Classification BIBAFull-Text 787-790
  Dhivya Eswaran; Paul N. Bennett; Joseph J., III Pfeiffer
Considerable work in web page classification has focused on incorporating the topical structure of the web (e.g., the hyperlink graph) to improve prediction accuracy. However, the majority of work has primarily focused on relational or graph-based methods that are impractical to run at scale or in an online environment. This raises the question of whether it is possible to leverage the topical structure of the web while incurring nearly no additional prediction-time cost. To this end, we introduce an approach which adjusts a page content-only classification from that obtained with a global prior to the posterior obtained by incorporating a prior which reflects the topic cohesion of the site. Using ODP data, we empirically demonstrate that our approach yields significant performance increases over a range of topics.
Topic-centric Classification of Twitter User's Political Orientation BIBAFull-Text 791-794
  Anjie Fang; Iadh Ounis; Philip Habel; Craig Macdonald; Nut Limsopatham
In the recent Scottish Independence Referendum (hereafter, IndyRef), Twitter offered a broad platform for people to express their opinions, with millions of IndyRef tweets posted over the campaign period. In this paper, we aim to classify people's voting intentions by the content of their tweets -- their short messages communicated on Twitter. By observing tweets related to the IndyRef, we find that people not only discussed the vote, but raised topics related to an independent Scotland including oil reserves, currency, nuclear weapons, and national debt. We show that the views communicated on these topics can inform us of the individuals' voting intentions ("Yes" -- in favour of Independence vs. "No" -- Opposed). In particular, we argue that an accurate classifier can be designed by leveraging the differences in the features' usage across different topics related to voting intentions. We demonstrate improvements upon a Naive Bayesian classifier using the topics enrichment method. Our new classifier identifies the closest topic for each unseen tweet, based on those topics identified in the training data. Our experiments show that our Topics-Based Naive Bayesian classifier improves accuracy by 7.8% over the classical Naive Bayesian baseline.
Word Embedding based Generalized Language Model for Information Retrieval BIBAFull-Text 795-798
  Debasis Ganguly; Dwaipayan Roy; Mandar Mitra; Gareth J. F. Jones
Word2vec, a state-of-the-art word embedding technique has gained a lot of interest in the NLP community. The embedding of the word vectors helps to retrieve a list of words that are used in similar contexts with respect to a given word. In this paper, we focus on using the word embeddings for enhancing retrieval effectiveness. In particular, we construct a generalized language model, where the mutual independence between a pair of words (say t and t') no longer holds. Instead, we make use of the vector embeddings of the words to derive the transformation probabilities between words. Specifically, the event of observing a term t in the query from a document d is modeled by two distinct events, that of generating a different term t', either from the document itself or from the collection, respectively, and then eventually transforming it to the observed query term t. The first event of generating an intermediate term from the document intends to capture how well does a term contextually fit within a document, whereas the second one of generating it from the collection aims to address the vocabulary mismatch problem by taking into account other related terms in the collection. Our experiments, conducted on the standard TREC collection, show that our proposed method yields significant improvements over LM and LDA-smoothed LM baselines.
A Head-Weighted Gap-Sensitive Correlation Coefficient BIBAFull-Text 799-802
  Ning Gao; Douglas Oard
Information retrieval systems rank documents, and shared-task evaluations yield results that can be used to rank information retrieval systems. Comparing rankings in ways that can yield useful insights is thus an important capability. When making such comparisons, it is often useful to give greater weight to comparisons near the head of a ranked list than to what happens further down. This is the focus of the widely used τAP measure. When scores are available, gap-sensitive measures give greater weight to larger differences than to smaller ones. This is the focus of the widely used Pearson correlation measure (ρ). This paper introduces a new measure, τGAP, which combines both features. System comparisons from the TREC 5 Ad Hoc track are used to illustrate the differences in emphasis achieved by τAP, ρ, and the proposed τGAP.
On Term Selection Techniques for Patent Prior Art Search BIBAFull-Text 803-806
  Mona Golestan Far; Scott Sanne; Mohamed Reda Bouadjenek; Gabriela Ferraro; David Hawking
In this paper, we investigate the influence of term selection on retrieval performance on the CLEF-IP prior art test collection, using the Description section of the patent query with Language Model (LM) and BM25 scoring functions. We find that an oracular relevance feedback system that extracts terms from the judged relevant documents far outperforms the baseline and performs twice as well on MAP as the best competitor in CLEF-IP 2010. We find a very clear term selection value threshold for use when choosing terms. We also noticed that most of the useful feedback terms are actually present in the original query and hypothesized that the baseline system could be substantially improved by removing negative query terms. We tried four simple automated approaches to identify negative terms for query reduction but we were unable to notably improve on the baseline performance with any of them. However, we show that a simple, minimal interactive relevance feedback approach where terms are selected from only the first retrieved relevant document outperforms the best result from CLEF-IP 2010 suggesting the promise of interactive methods for term selection in patent prior art search.
Automatic Feature Generation on Heterogeneous Graph for Music Recommendation BIBAFull-Text 807-810
  Chun Guo; Xiaozhong Liu
Online music streaming services (MSS) experienced exponential growth over the past decade. The giant MSS providers not only built massive music collection with metadata, they also accumulated large amount of heterogeneous data generated from users, e.g. listening history, comment, bookmark, and user generated playlist. While various kinds of user data can potentially be used to enhance the music recommendation performance, most existing studies only focused on audio content features and collaborative filtering approaches based on simple user listening history or music rating. In this paper, we propose a novel approach to solve the music recommendation problem by means of heterogeneous graph mining. Meta-path based features are automatically generated from a content-rich heterogeneous graph schema with 6 types of nodes and 16 types of relations. Meanwhile, we use learning-to-rank approach to integrate different features for music recommendation. Experiment results show that the automatically generated graphical features significantly (p<0.0001) enhance state-of-the-art collaborative filtering algorithm.
Differences in Eye-Tracking Measures Between Visits and Revisits to Relevant and Irrelevant Web Pages BIBAFull-Text 811-814
  Jacek Gwizdka; Yinglong Zhang
This short paper presents initial results from a project, in which we investigated differences in how users view relevant and irrelevant Web pages on their visits and revisits. The users' viewing of Web pages was characterized by eye-tracking measures, with a particular attention paid to changes in pupil size. The data was collected in a lab-based experiment, in which users (N=32) conducted assigned information search tasks on Wikipedia. We performed non-parametric tests of significance as well as classification. Our findings demonstrate differences in eye-tracking measures on visits and revisits to relevant and irrelevant pages and thus indicate a feasibility of predicting perceived Web document relevance from eye-tracking data. In particular, relative changes in pupil size differed significantly in almost all conditions. Our work extends results from previous studies to more realistic search scenarios and to Web page visits and revisits.
Reducing Hubness: A Cause of Vulnerability in Recommender Systems BIBAFull-Text 815-818
  Kazuo Hara; Ikumi Suzuki; Kei Kobayashi; Kenji Fukumizu
It is known that memory-based collaborative filtering systems are vulnerable to shilling attacks. In this paper, we demonstrate that hubness, which occurs in high dimensional data, is exploited by the attacks. Hence we explore methods for reducing hubness in user-response data to make these systems robust against attacks. Using the MovieLens dataset, we empirically show that the two methods for reducing hubness by transforming a similarity matrix(i) centering and (ii) conversion to a commute time kernel-can thwart attacks without degrading the recommendation performance.
Modularity-Based Query Clustering for Identifying Users Sharing a Common Condition BIBAFull-Text 819-822
  Maayan Gal-On Harel; Elad Yom-Tov
We present an algorithm for identifying users who share a common condition from anonymized search engine logs. Input to the algorithm is a set of seed phrases that identify users with the condition of interest with high precision albeit at a very low recall. We expand the set of seed phrases by clustering queries according to the pages users clicked following these queries and the temporal ordering of queries within sessions, emphasizing the subgraph containing seed phrases. To this end, we extend modularity-based clustering such that it uses the information in the initial seed phrases as well as other queries of users in the population of interest. We evaluate the performance of the proposed method on two datasets, one of mood disorders and the other of anorexia, by classifying users according to the clusters in which they appeared and the phrases contained thereof, and show that the area under the receiver operating characteristic curve (AUC) obtained by these methods exceeds 0.87. These results demonstrate the value of our algorithm for both identifying users for future research and to gain better understanding of the language associated with the condition.
Understanding Temporal Query Intent BIBAFull-Text 823-826
  Mohammed Hasanuzzaman; Sriparna Saha; Gaël Dias; Stéphane Ferrari
Understanding the temporal orientation of web search queries is an important issue for the success of information access systems. In this paper, we propose a multi-objective ensemble learning solution that (1) allows to accurately classify queries along their temporal intent and (2) identifies a set of performing solutions thus offering a wide range of possible applications. Experiments show that correct representation of the problem can lead to great classification improvements when compared to recent state-of-the-art solutions and baseline ensemble techniques.
On the Reusability of Open Test Collections BIBAFull-Text 827-830
  Seyyed Hadi Hashemi; Charles L. A. Clarke; Adriel Dean-Hall; Jaap Kamps; Julia Kiseleva
Creating test collections for modern search tasks is increasingly more challenging due to the growing scale and dynamic nature of content, and need for richer contextualization of the statements of request. To address these issues, the TREC Contextual Suggestion Track explored an open test collection, where participants were allowed to submit any web page as a result for a personalized venue recommendation task. This prompts the question on the reusability of the resulting test collection: How does the open nature affect the pooling process? Can participants reliably evaluate variant runs with the resulting qrels? Can other teams evaluate new runs reliably? In short, does the set of pooled and judged documents effectively produce a post hoc test collection? Our main findings are the following: First, while there is a strongly significant rank correlation, the effect of pooling is notable and results in underestimation of performance, implying the evaluation of non-pooled systems should be done with great care. Second, we extensively analyze impacts of open corpus on the fraction of judged documents, explaining how low recall affects the reusability, and how the personalization and low pooling depth aggravate that problem. Third, we outline a potential solution by deriving a fixed corpus from open web submissions.
Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis BIBAFull-Text 831-834
  Stefan Heindorf; Martin Potthast; Benno Stein; Gregor Engels
We report on the construction of the Wikidata Vandalism Corpus WDVC-2015, the first corpus for vandalism in knowledge bases. Our corpus is based on the entire revision history of Wikidata, the knowledge base underlying Wikipedia. Among Wikidata's 24 million manual revisions, we have identified more than 100,000 cases of vandalism. An in-depth corpus analysis lays the groundwork for research and development on automatic vandalism detection in public knowledge bases. Our analysis shows that 58% of the vandalism revisions can be found in the textual portions of Wikidata, and the remainder in structural content, e.g., subject-predicate-object triples. Moreover, we find that some vandals also target Wikidata content whose manipulation may impact content displayed on Wikipedia, revealing potential vulnerabilities. Given today's importance of knowledge bases for information systems, this shows that public knowledge bases must be used with caution.
About the 'Compromised Information Need' and Optimal Interaction as Quality Measure for Search Interfaces BIBAFull-Text 835-838
  Eduard C. Hoenkamp
Taylor's concept of levels of information need has been cited in over a hundred IR publications since his work was first published. It concerns the phases a searcher goes through, starting with the feeling that information seems missing, to expressing a query to the system that hopefully will provide that information. As every year more IR publications reference Taylor's work, but none of these so much as attempt to formalize the concept they use, it is doubtful that the term is always used with the same connotation. Hence we propose a formal definition of levels of information need, as especially in IR with its formal underpinnings, there is no excuse to leave frequently used terms undefined.
   We cast Taylor's informally defined levels of information need -- and the transitions between them -- as an evolving dynamical system subsuming two subsystems: the searcher and the search engine. This moves the focus from optimizing the search engine to optimizing the search interface. We define the quality of an interface by how much users need to compromise in order to fill their information need. We show how a theoretical optimum can be calculated that assumes the least compromise from the user.
   This optimum can be used to establish a base-line for measuring how much a search interface deviates from the ideal, given actual search behavior, and by the same token offers a measure of comparison among competing interfaces.
I See You: Person-of-Interest Search in Social Networks BIBAFull-Text 839-842
  Hsun-Ping Hsieh; Cheng-Te Li; Rui Yan
Searching for a particular person by specifying her name is one of the essential functions in online social networking services such as Facebook. So many times, however, one would like to find a person but what she knows is few social labels about the target, such as interests, skills, hometown, school, employment, etc. Assume each user is associated a set of social labels, we propose a novel search in online social network, Person-of-Interest (POI) Search, which aims to find a list of desired targets based on a set of user-specified query labels that depict the targets. We develop a greedy heuristic graph search algorithm, which finds the target who not only covers the query labels, but also either possesses better social interactions with peers or has higher social proximity towards the user. Experiments conducted on Facebook and Twitter datasets exhibit the satisfying accuracy and encourage more advanced efforts on POI search.
Towards Quantifying the Impact of Non-Uniform Information Access in Collaborative Information Retrieval BIBAFull-Text 843-846
  Nyi Nyi Htun; Martin Halvey; Lynne Baillie
The majority of research into Collaborative Information Retrieval (CIR) has assumed a uniformity of information access and visibility between collaborators. However in a number of real world scenarios, information access is not uniform between all collaborators in a team e.g. security, health etc. This can be referred to as Multi-Level Collaborative Information Retrieval (MLCIR). To the best of our knowledge, there has not yet been any systematic investigation of the effect of MLCIR on search outcomes. To address this shortcoming, in this paper, we present the results of a simulated evaluation conducted over 4 different non-uniform information access scenarios and 3 different collaborative search strategies. Results indicate that there is some tolerance to removing access to the collection and that there may not always be a negative impact on performance. We also highlight how different access scenarios and search strategies impact on search outcomes.
Features of Disagreement Between Retrieval Effectiveness Measures BIBAFull-Text 847-850
  Timothy Jones; Paul Thomas; Falk Scholer; Mark Sanderson
Many IR effectiveness measures are motivated from intuition, theory, or user studies. In general, most effectiveness measures are well correlated with each other. But, what about where they don't correlate? Which rankings cause measures to disagree? Are these rankings predictable for particular pairs of measures? In this work, we examine how and where metrics disagree, and identify differences that should be considered when selecting metrics for use in evaluating retrieval systems.
Subsequence Search in Event-Interval Sequences BIBAFull-Text 851-854
  Orestis Kostakis Kostakis; Aristides Gionis Gionis
We study the problem of subsequence search in databases of event-interval sequences, or e-sequences. In contrast to sequences of instantaneous events, e-sequences contain events that have a duration. In Information Retrieval applications, e-sequences are used for American Sign Language. We show that the subsequence-search problem is NP-hard and provide an exact (worst-case exponential) algorithm. We extend our algorithm to handle different cases of subsequence matching with errors. We then propose the Relation Index, a scheme for speeding up exact retrieval, which we benchmark against several indexing schemes.
Searcher in a Strange Land: Understanding Web Search from Familiar and Unfamiliar Locations BIBAFull-Text 855-858
  Elad Kravi; Eugene Agichtein; Ido Guy; Yaron Kanza; Avihai Mejer; Dan Pelleg
With mobile devices, web search is no longer limited to specific locations. People conduct search from practically anywhere, including at home, at work, when traveling and when on vacation. How should this influence search tools and web services? In this paper, we argue that information needs are affected by the familiarity of the environment. To formalize this idea, we propose a new contextualization model for activities on the web. The model distinguishes between a search from a familiar place (F-search) and a search from an unfamiliar place (U-search). We formalize the notion of familiarity, and propose a method to identify familiar places. An analysis of a query log of millions of users, demonstrates the differences between search activities in familiar and in unfamiliar locations. Our novel take on search contextualization has the potential to improve web applications, such as query autocompletion and search personalization.
Evaluating Retrieval Models through Histogram Analysis BIBAFull-Text 859-862
  Kriste Krstovski; David A. Smith; Michael J. Kurtz
We present a novel approach for efficiently evaluating the performance of retrieval models and introduce two evaluation metrics: Distributional Overlap (DO), which compares the clustering of scores of relevant and non-relevant documents, and Histogram Slope Analysis (HSA), which examines the log of the empirical distributions of relevant and non-relevant documents. Unlike rank evaluation metrics such as mean average precision (MAP) and normalized discounted cumulative gain (NDCG), DO and HSA only require calculating model scores of queries and a fixed sample of relevant and non-relevant documents rather than scoring the entire collection, even implicitly by means of an inverted index. In experimental meta-evaluations, we find that HSA achieves high correlation with MAP and NDCG on a monolingual and a cross-language document similarity task; on four ad-hoc web retrieval tasks; and on an analysis of ten TREC tasks from the past ten years. In addition, when evaluating latent Dirichlet allocation (LDA) models on document similarity tasks, HSA achieves better correlation with MAP and NCDG than perplexity, an intrinsic metric widely used with topic models.
Inter-Category Variation in Location Search BIBAFull-Text 863-866
  Chia-Jung Lee; Nick Craswell; Vanessa Murdock
When searching for place entities such as businesses or points of interest, the desired place may be close (finding the nearest ATM) or far away (finding a hotel in another city). Understanding the role of distance in predicting user interests can guide the design of location search and recommendation systems. We analyze a large dataset of location searches on GPS-enabled mobile devices with 15 location categories. We model user-location distance based on raw geographic distance (kilometers) and intervening opportunities (nth closest). Both models are helpful in predicting user interests, with the intervening opportunity model performing somewhat better. We find significant inter-category variation. For instance, the closest movie theater is selected in 17.7% of cases, while the closest restaurant in only 2.1% of cases. Overall, we recommend taking category information into account when modeling location preferences of users in search and recommendation systems.
Reachability based Ranking in Interactive Image Retrieval BIBAFull-Text 867-870
  Jiyi Li
In some interactive image retrieval systems, users can select images from image search results and click to view their similar or related images until they reach the targets. Existing image ranking options are based on relevance, update time, interestingness and so on. Because the inexact description of user targets or unsatisfying performance of image retrieval methods, it is possible that users cannot reach their targets in single-round interaction. When we consider multi-round interactions, how to assist users to select the images that are easier to reach the targets in fewer rounds is a useful issue. In this paper, we propose a new kind of ranking option to users by ranking the images according to their difficulties of reaching potential targets. We model the interactive image search behavior as navigation on information network constructed by an image collection and an image retrieval method. We use the properties of this information network for reachability based ranking. Experiments based on a social image collection show the efficiency of our approach.
Modeling Multi-query Retrieval Tasks Using Density Matrix Transformation BIBAFull-Text 871-874
  Qiuchi Li; Jingfei Li; Peng Zhang; Dawei Song
The quantum probabilistic framework has recently been applied to Information Retrieval (IR). A representative is the Quantum Language Model (QLM), which is developed for the ad-hoc retrieval with single queries and has achieved significant improvements over traditional language models. In QLM, a density matrix, defined on the quantum probabilistic space, is estimated as a representation of user's search intention with respect to a specific query. However, QLM is unable to capture the dynamics of user's information need in query history. This limitation restricts its further application on the dynamic search tasks, e.g., session search. In this paper, we propose a Session-based Quantum Language Model (SQLM) that deals with multi-query session search task. In SQLM, a transformation model of density matrices is proposed to model the evolution of user's information need in response to the user's interaction with search engine, by incorporating features extracted from both positive feedback (clicked documents) and negative feedback (skipped documents). Extensive experiments conducted on TREC 2013 and 2014 session track data demonstrate the effectiveness of SQLM in comparison with the classic QLM.
Predicting User Behavior in Display Advertising via Dynamic Collective Matrix Factorization BIBAFull-Text 875-878
  Sheng Li; Jaya Kawale; Yun Fu
Conversion prediction and click prediction are two important and intertwined problems in display advertising, but existing approaches usually look at them in isolation. In this paper, we aim to predict the conversion response of users by jointly examining the past purchase behavior and the click response behavior. Additionally, we model the temporal dynamics between the click response and purchase activity into a unified framework. In particular, a novel matrix factorization approach named dynamic collective matrix factorization (DCMF) is proposed to address this problem. Our model considers temporal dynamics of post-click conversions and also takes advantages of the side information of users, advertisements, and items. Experiments on a real-world marketing dataset show that our model achieves significant improvements over several baselines.
Zero-shot Image Tagging by Hierarchical Semantic Embedding BIBAFull-Text 879-882
  Xirong Li; Shuai Liao; Weiyu Lan; Xiaoyong Du; Gang Yang
Given the difficulty of acquiring labeled examples for many fine-grained visual classes, there is an increasing interest in zero-shot image tagging, aiming to tag images with novel labels that have no training examples present. Using a semantic space trained by a neural language model, the current state-of-the-art embeds both images and labels into the space, wherein cross-media similarity is computed. However, for labels of relatively low occurrence, its similarity to images and other labels can be unreliable. This paper proposes Hierarchical Semantic Embedding (HierSE), a simple model that exploits the WordNet hierarchy to improve label embedding and consequently image embedding. Moreover, we identify two good tricks, namely training the neural language model using Flickr tags instead of web documents, and using partial match instead of full match for vectorizing a WordNet node. All this lets us outperform the state-of-the-art. On a test set of over 1,500 visual object classes and 1.3 million images, the proposed model beats the current best results (18.3% versus 9.4% in hit@1).
Using Term Location Information to Enhance Probabilistic Information Retrieval BIBAFull-Text 883-886
  Baiyan Liu; Xiangdong An; Jimmy Xiangji Huang
Nouns are more important than other parts of speech in information retrieval and are more often found near the beginning or the end of sentences. In this paper, we investigate the effects of rewarding terms based on their location in sentences on information retrieval. Particularly, we propose a novel Term Location (TEL) retrieval model based on BM25 to enhance probabilistic information retrieval, where a kernel-based method is used to capture term placement patterns. Experiments on five TREC datasets of varied size and content indicate the proposed model significantly outperforms the optimized BM25 and DirichletLM in MAP over all datasets with all kernel functions, and excels the optimized BM25 and DirichletLM over most of the datasets in P@5 and P@20 with different kernel functions.
Learning Context-aware Latent Representations for Context-aware Collaborative Filtering BIBAFull-Text 887-890
  Xin Liu; Wei Wu
In this paper, we propose a generic framework to learn context-aware latent representations for context-aware collaborative filtering. Contextual contents are combined via a function to produce the context influence factor, which is then combined with each latent factor to derive latent representations. We instantiate the generic framework using biased Matrix Factorization as the base model. A Stochastic Gradient Descent (SGD) based optimization procedure is developed to fit the model by jointly learning the weight of each context and latent factors. Experiments conducted over three real-world datasets demonstrate that our model significantly outperforms not only the base model but also the representative context-aware recommendation models.
Exploiting User and Business Attributes for Personalized Business Recommendation BIBAFull-Text 891-894
  Kai Lu; Yi Zhang; Lanbo Zhang; Shuxin Wang
Data sparsity and cold-start are two major problems in personalized recommendation. They are especially severe in business recommendation, because business transactions are usually completed offline and customers generally do not provide ratings after a transaction. Due to these two problems, matrix factorization (MF) models, which are shown to be effective in many recommendation tasks, are likely to fail on business recommendation tasks, especially for new users and new items. In this paper, we propose an Integrated Bias and Factorization Model (IBFM), which exploits user and business attributes. The user attributes include demographic information, vote information, point-of-interests; the business attributes include check-in information, locations, business names, categories, etc. To handle the cold-start problem, we employ a sampling strategy to generate the latent factor vectors for new users and new businesses based on similar users/businesses. Our methods are evaluated on the data set used in the RecSys 2013 Yelp business rating prediction challenge. Experimental results show that our proposed methods significantly outperform several existing state-of-the-art methods. In particular, the single model IBFM performs the best in this challenge on both public and private leaderboards.
Speeding up Document Ranking with Rank-based Features BIBAFull-Text 895-898
  Claudio Lucchese; Franco Maria Nardini; Salvatore Orlando; Raffaele Perego; Nicola Tonellotto
Learning to Rank (LtR) is an effective machine learning methodology for inducing high-quality document ranking functions. Given a query and a candidate set of documents, where query-document pairs are represented by feature vectors, a machine-learned function is used to reorder this set. In this paper we propose a new family of rank-based features, which extend the original feature vector associated with each query-document pair. Indeed, since they are derived as a function of the query-document pair and the full set of candidate documents to score, rank-based features provide additional information to better rank documents and return the most relevant ones. We report a comprehensive evaluation showing that rank-based features allow us to achieve the desired effectiveness with ranking models being up to 3.5 times smaller than models not using them, with a scoring time reduction up to 70%.
Mining Measured Information from Text BIBAFull-Text 899-902
  Arun S. Maiya; Dale Visser; Andrew Wan
We present an approach to extract measured information from text (e.g., a 1370°C melting point, a BMI greater than 29.9 kg/m²). Such extractions are critically important across a wide range of domains -- especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352μm"). Finally, we present MQSearch: the realization of a search engine with full support for measured information.
An Initial Investigation into Fixed and Adaptive Stopping Strategies BIBAFull-Text 903-906
  David Maxwell; Leif Azzopardi; Kalervo Järvelin; Heikki Keskustalo
Most models, measures and simulations often assume that a searcher will stop at a predetermined place in a ranked list of results. However, during the course of a search session, real-world searchers will vary and adapt their interactions with a ranked list. These interactions depend upon a variety of factors, including the content and quality of the results returned, and the searcher's information need. In this paper, we perform a preliminary simulated analysis into the influence of stopping strategies when query quality varies. Placed in the context of ad-hoc topic retrieval during a multi-query search session, we examine the influence of fixed and adaptive stopping strategies on overall performance. Surprisingly, we find that a fixed strategy can perform as well as the examined adaptive strategies, but the fixed depth needs to be adjusted depending on the querying strategy used. Further work is required to explore how well the stopping strategies reflect actual search behaviour, and to determine whether one stopping strategy is dominant.
Regularised Cross-Modal Hashing BIBAFull-Text 907-910
  Sean Moran; Victor Lavrenko
In this paper we propose Regularised Cross-Modal Hashing (RCMH) a new cross-modal hashing model that projects annotation and visual feature descriptors into a common Hamming space. RCMH optimises the hashcode similarity of related data-points in the annotation modality using an iterative three-step hashing algorithm: in the first step each training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines.
Adapted B-CUBED Metrics to Unbalanced Datasets BIBAFull-Text 911-914
  Jose G. Moreno; Gaël Dias
B-CUBED metrics have recently been adopted in the evaluation of clustering results as well as in many other related tasks. However, this family of metrics is not well adapted when datasets are unbalanced. This issue is extremely frequent in Web results, where classes are distributed following a strong unbalanced pattern. In this paper, we present a modified version of B-CUBED metrics to overcome this situation. Results in toy and real datasets indicate that the proposed adaptation correctly considers the particularities of unbalanced cases.
A Time-aware Random Walk Model for Finding Important Documents in Web Archives BIBAFull-Text 915-918
  Tu Ngoc Nguyen; Nattiya Kanhabua; Claudia Niederée; Xiaofei Zhu
Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, web archives are emerging as gold-mines for content analytics of many sorts. However, supporting search, which goes beyond navigational search via URLs, is a very challenging task in these unique structures with huge, redundant and noisy temporal content. In this paper, we address the search needs of expert users such as journalists, economists or historians for discovering a topic in time: Given a query, the top-k returned results should give the best representative documents that cover most interesting time-periods for the topic. For this purpose, we propose a novel random walk-based model that integrates relevance, temporal authority, diversity and time in a unified framework. Our preliminary experimental results on the large-scale real-world web archival collection shows that our method significantly improves the state-of-the-art algorithms (i.e., PageRank) in ranking temporal web pages.
A Test Collection for Spoken Gujarati Queries BIBAFull-Text 919-922
  Douglas W. Oard; Rashmi Sankepally; Jerome White; Aren Jansen; Craig Harman
The development of a new test collection is described in which the task is to search naturally occurring spoken content using naturally occurring spoken queries. To support research on speech retrieval for low-resource settings, the collection includes terms learned by zero-resource term discovery techniques. Use of a new tool designed for exploration of spoken collections provides some additional insight into characteristics of the collection.
Discovering Experts across Multiple Domains BIBAFull-Text 923-926
  Aditya Pal
Researchers have focused on finding experts in individual domains, such as emails, forums, question answering, blogs, and microblogs. In this paper, we propose an algorithm for finding experts across these different domains. To do this, we propose an expertise framework that aims at extracting key expertise features and building an unified scoring model based on SVM ranking algorithm. We evaluate our model on a real World dataset and show that it is significantly better than the prior state-of-art.
Using Key Concepts in a Translation Model for Retrieval BIBAFull-Text 927-930
  Jae Hyun Park; W. Bruce Croft
Many queries, especially those in the form of longer questions, contain a subset of terms representing key concepts that describe the most important part of the user's information need. Detecting the key concepts in a query can be used as the basis for more effective weighting of query terms, but in this paper, we focus on a method of using the key concepts in a translation model for query expansion and retrieval. Translation models have been used previously in community-based question answering (CQA) systems in order to bridge the semantic gap between questions and the corresponding answer documents. Our method uses the key concepts of a question as the translation context and selectively applies the translation model to the secondary (non-key) parts of the question. We evaluate the proposed method using a CQA collection and show that selectively translating key and secondary concepts can significantly improve the retrieval performance compared to a baseline that applies the translation model without considering key concepts.
On the Cost of Phrase-Based Ranking BIBAFull-Text 931-934
  Matthias Petri; Alistair Moffat
Effective postings list compression techniques, and the efficiency of postings list processing schemes such as WAND, have significantly improved the practical performance of ranked document retrieval using inverted indexes. Recently, suffix array-based index structures have been proposed as a complementary tool, to support phrase searching. The relative merits of these alternative approaches to ranked querying using phrase components are, however, unclear. Here we provide: (1) an overview of existing phrase indexing techniques; (2) a description of how to incorporate recent advances in list compression and processing; and (3) an empirical evaluation of state-of-the-art suffix-array and inverted file-based phrase retrieval indexes using a standard IR test collection.
Location-Aware Model for News Events in Social Media BIBAFull-Text 935-938
  Mauricio Quezada; Vanessa Peña-Araya; Barbara Poblete
Nowadays, social media services are being used extensively as news sources and for spreading information on real-world events. Several studies have focused on detecting those events and locating them geographically. However, in order to study real-world events, for example, finding relationships between locations or detecting high impact events based on their coverage, we need more suitable models to represent events. In this work we propose a simple model to represent real-world news events using two sources of information: the locations that are mentioned in the event (where the event occurs), and the locations of users that discuss or comment on it. We then characterize a country based on the amount of events in which that country is mentioned and also participates on the event. We show some applications of the model: we find clusters of news events based on the level of participation of countries, identifying global and impactful events in certain areas. Also, we show groups of similar countries, finding promising insights about their relationships. This model can be useful at finding unsuspected relations among countries based on the news coverage and country participation, identifying different levels of news coverage in the world, and finding bias in international news sources.
Exploring Opportunities to Facilitate Serendipity in Search BIBAFull-Text 939-942
  Ataur Rahman; Max L. Wilson
Serendipitously discovering new information can bring many benefits. Although we can design systems to highlight serendipitous information, serendipity cannot be easily orchestrated and is thus hard to study. In this paper, we deployed a working search engine that matched search results with Facebook 'Like' data, as a technology probe to examine naturally occurring serendipitous discoveries. Search logs and diary entries revealed the nature of these occasions in both leisure and work contexts. The findings support the use of the micro-serendipity model in search system design.
Combining Orthogonal Information in Large-Scale Cross-Language Information Retrieval BIBAFull-Text 943-946
  Shigehiko Schamoni; Stefan Riezler
System combination is an effective strategy to boost retrieval performance, especially in complex applications such as cross-language information retrieval (CLIR) where the aspects of translation and retrieval have to be optimized jointly. We focus on machine learning-based approaches to CLIR that need large sets of relevance-ranked data to train high-dimensional models. We compare these models under various measures of orthogonality, and present an experimental evaluation on two different domains (patents, Wikipedia) and two different language pairs (Japanese-English, German-English). We show that gains of over 10 points in MAP/NDCG can be achieved over the best single model by a linear combination of the models that contribute the most orthogonal information, rather than by combining the models with the best standalone retrieval performance.
Tailoring Music Recommendations to Users by Considering Diversity, Mainstreaminess, and Novelty BIBAFull-Text 947-950
  Markus Schedl; David Hauger
A shortcoming of current approaches for music recommendation is that they consider user-specific characteristics only on a very simple level, typically as some kind of interaction between users and items when employing collaborative filtering. To alleviate this issue, we propose several user features that model aspects of the user's music listening behavior: diversity, mainstreaminess, and novelty of the user's music taste. To validate the proposed features, we conduct a comprehensive evaluation of a variety of music recommendation approaches (stand-alone and hybrids) on a collection of almost 200 million listening events gathered from Last.fm. We report first results and highlight cases where our diversity, mainstreaminess, and novelty features can be beneficially integrated into music recommender systems.
Challenges of Mathematical Information Retrieval in the NTCIR-11 Math Wikipedia Task BIBAFull-Text 951-954
  Moritz Schubotz; Abdou Youssef; Volker Markl; Howard S. Cohl
Mathematical Information Retrieval concerns retrieving information related to a particular mathematical concept. The NTCIR-11 Math Task develops an evaluation test collection for document sections retrieval of scientific articles based on human generated topics. Those topics involve a combination of formula patterns and keywords. In addition, the optional Wikipedia Task provides a test collection for retrieval of individual mathematical formula from Wikipedia based on search topics that contain exactly one formula pattern. We developed a framework for automatic query generation and immediate evaluation. This paper discusses our dataset preparation, topic generation and evaluation methods, and summarizes the results of the participants, with a special focus on the Wikipedia Task.
Probabilistic Multileave for Online Retrieval Evaluation BIBAFull-Text 955-958
  Anne Schuth; Robert-Jan Bruintjes; Fritjof Buüttner; Joost van Doorn; Carla Groenland; Harrie Oosterhuis; Cong-Nguyen Tran; Bas Veeling; Jos van der Velde; Roger Wechsler; David Woudenberg; Maarten de Rijke
Online evaluation methods for information retrieval use implicit signals such as clicks from users to infer preferences between rankers. A highly sensitive way of inferring these preferences is through interleaved comparisons. Recently, interleaved comparisons methods that allow for simultaneous evaluation of more than two rankers have been introduced. These so-called multileaving methods are even more sensitive than their interleaving counterparts. Probabilistic interleaving -- whose main selling point is the potential for reuse of historical data -- has no multileaving counterpart yet. We propose probabilistic multileave and empirically show that it is highly sensitive and unbiased. An important implication of this result is that historical interactions with multileaved comparisons can be reused, allowing for ranker comparisons that need much less user interaction data. Furthermore, we show that our method, as opposed to earlier sensitive multileaving methods, scales well when the number of rankers increases.
Twitter Sentiment Analysis with Deep Convolutional Neural Networks BIBAFull-Text 959-962
  Aliaksei Severyn; Alessandro Moschitti
This paper describes our deep learning system for sentiment analysis of tweets. The main contribution of this work is a new model for initializing the parameter weights of the convolutional neural network, which is crucial to train an accurate model while avoiding the need to inject any additional features. Briefly, we use an unsupervised neural language model to train initial word embeddings that are further tuned by our deep learning model on a distant supervised corpus. At a final stage, the pre-trained parameters of the network are used to initialize the model. We train the latter on the supervised training data recently made available by the official system evaluation campaign on Twitter Sentiment Analysis organized by Semeval-2015. A comparison between the results of our approach and the systems participating in the challenge on the official test sets, suggests that our model could be ranked in the first two positions in both the phrase-level subtask A (among 11 teams) and on the message-level subtask B (among 40 teams). This is an important evidence on the practical value of our solution.
Anchoring and Adjustment in Relevance Estimation BIBAFull-Text 963-966
  Milad Shokouhi; Ryen White; Emine Yilmaz
People's tendency to overly rely on prior information has been well studied in psychology in the context of anchoring and adjustment. Anchoring biases pervade many aspects of human behavior. In this paper, we present a study of anchoring bias in information retrieval (IR) settings. We provide strong evidence of anchoring during the estimation of document relevance via both human relevance judging and in natural user behavior collected via search log analysis. In particular, we show that sequential relevance judgment of documents collected for the same query could be subject to anchoring bias. That is, the human annotators are likely to assign different relevance labels to a document, depending on the quality of the last document they had judged for the same query.
   In addition to manually assigned labels, we further show that the implicit relevance labels inferred from click logs can also be affected by anchoring bias. Our experiments over the query logs of a commercial search engine suggested that searchers' interaction with a document can be highly affected by the documents visited immediately beforehand. Our findings have implications for the design of search systems and judgment methodologies that consider and adapt to anchoring effects.
Cognitive Activity during Web Search BIBAFull-Text 967-970
  Md. Hedayetul Islam Shovon; D. (Nanda) Nandagopal; Jia Tina Du; Ramasamy Vijayalakshmi; Bernadine Cocks
Searching on the Web or Net-surfing is a part of everyday life for many people, but little is known about the brain activity during Web searching. Such knowledge is essential for better understanding of the cognitive demands imposed by the search system and search tasks. The current study contributes to this understanding by constructing brain networks from EEG data using normalized transfer entropy (NTE) during three Web search task stages: query formulation, viewing of a search result list and reading each individual content page. This study further contributes to the connectivity analysis of the constructed brain networks, since it is an advanced quantitative technique which enables the exploration of brain function by distinct and varied brain areas. By using this approach, we identified that the cognitive activities during the three stages of Web searching are different, with various brain areas becoming more active during the three Web search task stages. Of note, query formulation generated higher interaction between cortical regions than viewing a result list or reading a content page. These findings will have implications for the improvement of Web search engines and search interfaces.
Personalized Semantic Ranking for Collaborative Recommendation BIBAFull-Text 971-974
  Song Xu; Shu Wu; Liang Wang
Recently a ranking view of collaborative recommendation has received much attention in recommendation systems. Most of existing ranking approaches are based on pairwise assumption, i.e., everything that has not been selected is of less interest for a user. However it is usually not proper in many cases. To alleviate the limitation of this assumption, in this work, we present a unified framework, named Personalized Semantic Ranking (PSR). PSR models the personalized ranking and the user-generated content (UGC) simultaneously, and the semantic information extracted from UGC can make a remedy for the pairwise assumption. Moreover, utilizing the semantic information, PSR can capture the more subtle information of the user-item interaction and alleviate the overfitting problem caused by insufficient ratings. The learned topics in PSR can also serve as proper explanations for recommendation. Experimental results show that the proposed PSR yields significant improvements over the competitive compared methods on two typical datasets.
Active Learning for Entity Filtering in Microblog Streams BIBAFull-Text 975-978
  Damiano Spina; Maria-Hendrike Peetz; Maarten de Rijke
Monitoring the reputation of entities such as companies or brands in microblog streams (e.g., Twitter) starts by selecting mentions that are related to the entity of interest. Entities are often ambiguous (e.g., "Jaguar" or "Ford") and effective methods for selectively removing non-relevant mentions often use background knowledge obtained from domain experts. Manual annotations by experts, however, are costly. We therefore approach the problem of entity filtering with active learning, thereby reducing the annotation load for experts. To this end, we use a strong passive baseline and analyze different sampling methods for selecting samples for annotation. We find that margin sampling -- an informative type of sampling that considers the distance to the hyperplane used for class separation -- can effectively be used for entity filtering and can significantly reduce the cost of annotating initial training data.
Relevance-aware Filtering of Tuples Sorted by an Attribute Value via Direct Optimization of Search Quality Metrics BIBAFull-Text 979-982
  Nikita V. Spirin; Mikhail Kuznetsov; Julia Kiseleva; Yaroslav V. Spirin; Pavel A. Izhutov
Sorting tuples by an attribute value is a common search scenario and many search engines support such capabilities, e.g. price-based sorting in e-commerce, time-based sorting on a job or social media website. However, sorting purely by the attribute value might lead to poor user experience because the relevance is not taken into account. Hence, at the top of the list the users might see irrelevant results. In this paper we choose a different approach. Rather than just returning the entire list of results sorted by the attribute value, additionally we suggest doing the relevance-aware search results (post-) filtering. Following this approach, we develop a new algorithm based on the dynamic programming that directly optimizes a given search quality metric. It can be seamlessly integrated as the final step of a query processing pipeline and provides a theoretical guarantee on optimality. We conduct a comprehensive evaluation of our algorithm on synthetic data and real learning to rank data sets. Based on the experimental results, we conclude that the proposed algorithm is superior to typically used heuristics and has a clear practical value for the search and related applications.
Multi-source Information Fusion for Personalized Restaurant Recommendation BIBAFull-Text 983-986
  Jing Sun; Yun Xiong; Yangyong Zhu; Junming Liu; Chu Guan; Hui Xiong
In this paper, we study the problem of personalized restaurant recommendations. Specifically, we develop a probabilistic factor analysis framework, named RMSQ-MF, which has the ability in exploiting multi-source information, such as the users' task, their friends' preferences, and human mobility patterns, for personalized restaurant recommendations. The rationale of this work is motivated by two observations. First, people's preferences can be affected by their friends. Second, human mobility patterns can reflect the popularity of restaurants to a certain degree. Finally, empirical studies on real-world data demonstrate that the proposed method outperforms benchmark methods with a significant margin.
Joint Matrix Factorization and Manifold-Ranking for Topic-Focused Multi-Document Summarization BIBAFull-Text 987-990
  Jiwei Tan; Xiaojun Wan; Jianguo Xiao
Manifold-ranking has proved to be an effective method for topic-focused multi-document summarization. As basic manifold-ranking based summarization method constructs the relationships between sentences simply by the bag-of-words cosine similarity, we believe a better similarity metric will further improve the effectiveness of manifold-ranking. In this paper, we propose a joint optimization framework, which integrates the manifold-ranking process with a similarity metric learning process. The joint framework aims at learning better sentence similarity scores and better sentence ranking scores simultaneously. Experiments on DUC datasets show the proposed joint method achieves better performance than the manifold-ranking baselines and several popular methods.
Towards Understanding the Impact of Length in Web Search Result Summaries over a Speech-only Communication Channel BIBAFull-Text 991-994
  Johanne R. Trippas; Damiano Spina; Mark Sanderson; Lawrence Cavedon
Presenting search results over a speech-only communication channel involves a number of challenges for users due to cognitive limitations and the serial nature of speech. We investigated the impact of search result summary length in speech-based web search, and compared our results to a text baseline. Based on crowdsourced workers, we found that users preferred longer, more informative summaries for text presentation. For audio, user preferences depended on the style of query. For single-facet queries, shortened audio summaries were preferred, additionally users were found to judge relevance with a similar accuracy compared to text-based summaries. For multi-facet queries, user preferences were not as clear, suggesting that more sophisticated techniques are required to handle such queries.
Early Detection of Topical Expertise in Community Question Answering BIBAFull-Text 995-998
  David van Dijk; Manos Tsagkias; Maarten de Rijke
We focus on detecting potential topical experts in community question answering platforms early on in their lifecycle. We use a semi-supervised machine learning approach. We extract three types of feature: (i) textual, (ii) behavioral, and (iii) time-aware, which we use to predict whether a user will become an expert in the longterm. We compare our method to a machine learning method based on a state-of-the-art method in expertise retrieval. Results on data from Stack Overflow demonstrate the utility of adding behavioral and time-aware features to the baseline method with a net improvement in accuracy of 26% for very early detection of expertise.
LBMCH: Learning Bridging Mapping for Cross-modal Hashing BIBAFull-Text 999-1002
  Yang Wang; Xuemin Lin; Lin Wu; Wenjie Zhang; Qing Zhang
Hashing has gained considerable attention on large-scale similarity search, due to its enjoyable efficiency and low storage cost. In this paper, we study the problem of learning hash functions in the context of multi-modal data for cross-modal similarity search. Notwithstanding the progress achieved by existing methods, they essentially learn only one common hamming space, where data objects from all modalities are mapped to conduct similarity search. However, such method is unable to well characterize the flexible and discriminative local (neighborhood) structure in all modalities simultaneously, hindering them to achieve better performance. Bearing such stand-out limitation, we propose to learn heterogeneous hamming spaces with each preserving the local structure of data objects from an individual modality. Then, a novel method to learning bridging mapping for cross-modal hashing, named LBMCH, is proposed to characterize the cross-modal semantic correspondence by seamlessly connecting these distinct hamming spaces. Meanwhile, the local structure of each data object in a modality is preserved by constructing an anchor based representation, enabling LBMCH to characterize a linear complexity w.r.t the size of training set. The efficacy of LBMCH is experimentally validated against real-world cross-modal datasets.
Gibberish, Assistant, or Master?: Using Tweets Linking to News for Extractive Single-Document Summarization BIBAFull-Text 1003-1006
  Zhongyu Wei; Wei Gao
Single-document summarization is a challenging task. In this paper, we explore effective ways using the tweets linking to news for generating extractive summary of each document. We reveal the very basic value of tweets that can be utilized by regarding every tweet as a vote for candidate sentences. Base on such finding, we resort to unsupervised summarization models by leveraging the linking tweets to master the ranking of candidate extracts via random walk on a heterogeneous graph. The advantage is that we can use the linking tweets to opportunistically "supervise" the summarization with no need of reference summaries. Furthermore, we analyze the influence of the volume and latency of tweets on the quality of output summaries since tweets come after news release. Compared to truly supervised summarizer unaware of tweets, our method achieves significantly better results with reasonably small tradeoff on latency; compared to the same using tweets as auxiliary features, our method is comparable while needing less tweets and much shorter time to achieve significant outperformance.
Context-aware Point-of-Interest Recommendation Using Tensor Factorization with Social Regularization BIBAFull-Text 1007-1010
  Lina Yao; Quan Z. Sheng; Yongrui Qin; Xianzhi Wang; Ali Shemshadi; Qi He
Point-of-Interest (POI) recommendation is a new type of recommendation task that comes along with the prevalence of location-based social networks in recent years. Compared with traditional tasks, it focuses more on personalized, context-aware recommendation results to provide better user experience. To address this new challenge, we propose a Collaborative Filtering method based on Non-negative Tensor Factorization, a generalization of the Matrix Factorization approach that exploits a high-order tensor instead of traditional User-Location matrix to model multi-dimensional contextual information. The factorization of this tensor leads to a compact model of the data which is specially suitable for context-aware POI recommendations. In addition, we fuse users' social relations as regularization terms of the factorization to improve the recommendation accuracy. Experimental results on real-world datasets demonstrate the effectiveness of our approach.
Adaptive User Engagement Evaluation via Multi-task Learning BIBAFull-Text 1011-1014
  Hamed Zamani; Pooya Moradi; Azadeh Shakery
User engagement evaluation task in social networks has recently attracted considerable attention due to its applications in recommender systems. In this task, the posts containing users' opinions about items, e.g., the tweets containing the users' ratings about movies in the IMDb website, are studied. In this paper, we try to make use of tweets from different web applications to improve the user engagement evaluation performance. To this aim, we propose an adaptive method based on multi-task learning. Since in this paper we study the problem of detecting tweets with positive engagement which is a highly imbalanced classification problem, we modify the loss function of multi-task learning algorithms to cope with the imbalanced data. Our evaluations over a dataset including the tweets of four diverse and popular data sources, i.e., IMDb, YouTube, Goodreads, and Pandora, demonstrate the effectiveness of the proposed method. Our findings suggest that transferring knowledge between data sources can improve the user engagement evaluation performance.
Compact Snippet Caching for Flash-based Search Engines BIBAFull-Text 1015-1018
  Rui Zhang; Pengyu Sun; Jiancong Tong; Rebecca Jean Stones; Gang Wang; Xiaoguang Liu
In response to a user query, search engines return the top-k relevant results, each of which contains a small piece of text, called a snippet, extracted from the corresponding document. Obtaining a snippet is time consuming as it requires both document retrieval (disk access) and string matching (CPU computation), so caching of snippets is used to reduce latency. With the trend of using flash-based solid state drives (SSDs) instead of hard disk drives for search engine storage, the bottleneck of snippet generation shifts from I/O to computation. We propose a simple, but effective method for exploiting this trend, which we call fragment caching: instead of caching the whole snippet, we only cache snippet metadata which describe how to retrieve the snippet from the document. While this approach increases I/O time, the cost is insignificant on SSDs. The major benefit of fragment caching is the ability to cache the same snippets (without loss of quality) while only using a fraction of the memory the traditional method requires. In our experiments, we find around 10 times less memory is required to achieve comparable snippet generation times for dynamic memory, and we consistently achieve a vastly greater hit ratio for static caching.
When Personalization Meets Conformity: Collective Similarity based Multi-Domain Recommendation BIBAFull-Text 1019-1022
  Xi Zhang; Jian Cheng; Shuang Qiu; Zhenfeng Zhu; Hanqing Lu
Existing recommender systems place emphasis on personalization to achieve promising accuracy. However, in the context of multiple domain, users are likely to seek the same behaviors as domain authorities. This conformity effect provides a wealth of prior knowledge when it comes to multi-domain recommendation, but has not been fully exploited. In particular, users whose behaviors are significant similar with the public tastes can be viewed as domain authorities. To detect these users meanwhile embed conformity into recommendation, a domain-specific similarity matrix is intuitively employed. Therefore, a collective similarity is obtained to leverage the conformity with personalization. In this paper, we establish a Collective Structure Sparse Representation (CSSR) method for multi-domain recommendation. Based on adaptive k-Nearest-Neighbor framework, we impose the lasso and group lasso penalties as well as least square loss to jointly optimize the collective similarity. Experimental results on real-world data confirm the effectiveness of the proposed method.
Sub-document Timestamping of Web Documents BIBAFull-Text 1023-1026
  Yue Zhao; Claudia Hauff
Knowledge about a (Web) document's creation time has been shown to be an important factor in various temporal information retrieval settings. Commonly, it is assumed that such documents were created at a single point in time. While this assumption may hold for news articles and similar document types, it is a clear oversimplification for general Web documents. In this paper, we investigate to what extent (i) this simplifying assumption is violated for a corpus of Web documents, and, (ii) it is possible to accurately estimate the creation time of individual Web documents' components (so-called sub-documents).


DINFRA: A One Stop Shop for Computing Multilingual Semantic Relatedness BIBAFull-Text 1027-1028
  Siamak Barzegar; Juliano Efson Sales; Andre Freitas; Siegfried Handschuh; Brian Davis
This demonstration presents an infrastructure for computing multilingual semantic relatedness and correlation for twelve natural languages by using three distributional semantic models (DSMs). Our demonstrator -- DInfra (Distributional Infrastructure) provides researchers and developers with a highly useful platform for processing large-scale corpora and conducting experiments with distributional semantics. We integrate several multilingual DSMs in our webservice so end user can obtain a result without worrying about the complexities involved in building DSMs. Our webservice allows the users to have easy access to a wide range of comparisons of DSMs with different parameters. In addition, users can configure and access DSM parameters using a easy to use API.
VenueMusic: A Venue-Aware Music Recommender System BIBAFull-Text 1029-1030
  Zhiyong Cheng; Jialie Shen
Users' music preferences can be greatly influenced by their location and environment nearby. In this demonstration, we present an intelligent music recommender system, called VenueMusic, to automatically identify suitable music for various popular venues in our daily lives. VenueMusic enjoys a set of nice features: i) music concept sequence generation scheme and Location-aware Topic Model (LTM) are proposed to map the characteristics of venues and music into a latent semantic space, where suitability of music for a venue can be directly measured, ii) a smart interface enabling user to smoothly interact with VenueMusic, and iii) high quality music playlist. The demonstration will show several interesting use-cases of VenueMusic, and illustrate its superiority on recommending music based on where user presents.
Shiny on Your Crazy Diagonal BIBAFull-Text 1031-1032
  Giorgio Maria Di Nunzio
In this demo, we present a web application which allows users to interact with two retrieval models, namely the Binary Independence Model (BIM) and the BM25 model, on a standard TREC collection. The goal of this demo is to give students deeper insight into the consequences of modeling assumptions (BIM vs. BM25) and the consequences of tuning parameter values by means of a two-dimensional representation of probabilities. The application was developed in R, and it is accessible at the following link: http://gmdn.shinyapps.io/shinyRF04.
CricketLinking: Linking Event Mentions from Cricket Match Reports to Ball Entities in Commentaries BIBAFull-Text 1033-1034
  Manish Gupta
The 2011 Cricket World Cup final match was watched by around 135 million people. Such a huge viewership demands a great experience for users of online cricket portals. Many portals like espncricinfo.com host a variety of content related to recent matches including match reports and ball-by-ball commentaries. When reading a match report, reader experience can be significantly improved by augmenting (on demand) the event mentions in the report with detailed commentaries. We build an event linking system CricketLinking which first identifies event mentions from the reports and then links them to a set of balls. Finding linkable mentions is challenging because unlike entity linking problem settings, we do not have a concrete set of event entities to link to. Further, depending on the event type, event mentions could be linked to a single ball, or to a set of balls. Hence, identifying mention type as well as linking becomes challenging. We use a large number of domain specific features to learn classifiers for mention and mention type detection. Further, we leverage structured match, context similarity and sequential proximity to perform accurate linking. Finally, context based summarization is performed to provide a concise briefing of linked balls to each mention.
An Aspect-driven Social Media Explorer BIBAFull-Text 1035-1036
  Nedim Lipka; W. Bruce Croft
We demonstrate an exploration tool that organizes social media content under diverse aspects enabling comprehensive explorations. Unlike existing approaches that group content by trending topics, we present a holistic view of diverse and relevant content with respect to a given query.
ERICA: Expert Guidance in Validating Crowd Answers BIBAFull-Text 1037-1038
  Nguyen Quoc Viet Hung; Duong Chi Thang; Matthias Weidlich; Karl Aberer
Crowdsourcing became an essential tool for a broad range of Web applications. Yet, the wide-ranging levels of expertise of crowd workers as well as the presence of faulty workers call for quality control of the crowdsourcing result. To this end, many crowdsourcing platforms feature a post-processing phase, in which crowd answers are validated by experts. This approach incurs high costs though, since expert input is a scarce resource. To support the expert in the validation process, we present a tool for ExpeRt guidance In validating Crowd Answers (ERICA). It allows us to guide the expert's work by collecting input on the most problematic cases, thereby achieving a set of high quality answers even if the expert does not validate the complete answer set. The tool also supports the task requester in selecting the most cost-efficient allocation of the budget between the expert and the crowd.
Large-scale Image Retrieval using Neural Net Descriptors BIBFull-Text 1039-1040
  David Novak; Michal Batko; Pavel Zezula
Galean: Visualization of Geolocated News Events from Social Media BIBAFull-Text 1041-1042
  Vanessa Peña-Araya; Mauricio Quezada; Barbara Poblete
Online Social Networks (OSN) have changed the way information is produced and consumed. Organizing and retrieving unstructured data extracted from these platforms is not an easy task. Galean is a visual and interactive tool that aims to help journalists and historians, among others, analyze news events discussed on Twitter. In this tool, news events are visually represented by the very countries from where the news originated, the date when they happened and their impact in the OSN. Galean considers countries as entities, as opposed to mere geographical locations as most of the tools in the state of the art. As a consequence, it allows users to explore and retrieve news not only by their geographical and temporal context, but also by the relationship among countries. With this tool users can search for behavioral patterns of news events and observe how countries are associated in specific events. We expect our work to become a public tool that helps conduct historical analyses of social media news coverage over time.
SciNet: Interactive Intent Modeling for Information Discovery BIBAFull-Text 1043-1044
  Tuukka Ruotsalo; Jaakko Peltonen; Manuel J. A. Eugster; Dorota Glowacka; Aki Reijonen; Giulio Jacucci; Petri Myllymäki; Samuel Kaski
Current search engines offer limited assistance for exploration and information discovery in complex search tasks. Instead, users are distracted by the need to focus their cognitive efforts on finding navigation cues, rather than selecting relevant information. Interactive intent modeling enhances the human information exploration capacity through computational modeling, visualized for interaction. Interactive intent modeling has been shown to increase task-level information seeking performance by up to 100%. In this demonstration, we showcase SciNet, a system implementing interactive intent modeling on top of a scientific article database of over 60 million documents.
Linse: A Distributional Semantics Entity Search Engine BIBAFull-Text 1045-1046
  Juliano Efson Sales; André Freitas; Siegfried Handschuh; Brian Davis
Entering 'Football Players from United States' when searching for 'American Footballers' is an example of vocabulary mismatch, which occurs when different words are used to express the same concepts. In order to address this phenomenon for entity search targeting descriptors for complex categories, we propose a compositional-distributional semantics entity search engine, which extracts semantic and commonsense knowledge from large-scale corpora to address the vocabulary gap between query and data.
Online News Tracking for Ad-Hoc Queries BIBAFull-Text 1047-1048
  Jeroen B. P. Vuurens; Arjen P. de Vries; Roi Blanco; Peter Mika
Following news about a specific event can be a difficult task as new information is often scattered across web pages. An up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. We demonstrate an approach that is feasible for online tracking of news that is relevant to a user's ad-hoc query.
DUMPLING: A Novel Dynamic Search Engine BIBAFull-Text 1049-1050
  Andrew Jie Zhou; Jiyun Luo; Hui Yang
In this demo paper, we introduce a new search engine that supports Information Retrieval (IR) in a dynamic setting. A dynamic search engine distinguishes itself by handling rich interactions and temporal dependency among the queries in a session or for a task. The proposed search engine is called Dumpling, named after the development team's favorite food. It implements state-of-the-art dynamic search algorithms and provides: (i) a dynamic search toolkit by integrating the Query Change Retrieval Model (QCM) and the Win-win search algorithm; (ii) a user-friendly interface supporting side-by-side comparison of search results given by a state-of-the-art static search algorithm and the proposed dynamic search algorithms; (iii) and APIs for developers to apply the dynamic search algorithms to index and search over custom datasets. Dumpling is developed under the umbrella of a bigger project in the DARPA Memex program to crawl and search the dark web to support law enforcement and national security.

Doctoral Consortium

Promoting User Engagement and Learning in Amorphous Search Tasks BIBAFull-Text 1051
  Piyush Arora
Much research in information retrieval (IR) focuses on optimization of the rank of relevant retrieval results for single shot ad hoc IR tasks. Relatively little research has been carried out on user engagement to support more complex search tasks. We seek to improve user engagement for IR tasks by providing richer representation of retrieved information. It is our expectation that this strategy will promote implicit learning within search activities. Specifically, we plan to explore methods of finding semantic concepts within retrieved documents, with the objective of creating improved document surrogates. Further, we would like to study search effectiveness in terms of different facets such as the user's search experience, satisfaction, engagement and learning. We intend to investigate this in an experimental study, where our richer document representations are compared with the traditional document surrogates for the same user queries.
Cross-Platform Question Routing for Better Question Answering BIBAFull-Text 1053
  Mossaab Bagdouri
The last two decades have seen an increasing interest in the task of question answering (QA). Earlier approaches focused on automated retrieval and extraction models. Recent developments have more focus on community driven QA. This work addresses this task through cross-platform question routing. We study question types as well as the answers that can be gathered from different platforms. After developing new evaluation measures, we optimize for various constraints of the user needs. We consider models that work for the general public, before adapting them to some special demographics (Arab journalists).
Time Pressure in Information Search BIBAFull-Text 1055
  Anita Crescenzi
The primary purpose of this research is to explore the impact of perceived time pressure on search behaviors, searcher perceptions of the search system and the search experience. Are there observable behavioral changes when a searcher is time-pressured? To what extent are search behavior differences attributable to objective experimental manipulation versus to the subjective experience of time pressure? An important secondary purpose of this work is to identify appropriate outcome measures that allow for the comparison of session-level search behaviors when time is manipulated.
Controversy Detection and Stance Analysis BIBAFull-Text 1057
  Shiri Dori-Hacohen
Alerting users about controversial search results can encourage critical literacy, promote healthy civic discourse and counteract the "filter bubble" effect. Additionally, presenting information to the user about the different stances or sides of the debate can help her navigate the landscape of search results. Our existing work made strides in the emerging niche of controversy detection and analysis; we propose further work on automatic stance detection.
Using Contextual Information to Understand Searching and Browsing Behavior BIBAFull-Text 1059
  Julia Kiseleva
There is great imbalance in the richness of information on the web and the succinctness and poverty of search requests of web users, making their queries only a partial description of the underlying complex information needs. Finding ways to better leverage contextual information and make search context-aware holds the promise to dramatically improve the search experience of users. We conducted a series of studies to discover, model and utilize contextual information in order to understand and improve users' searching and browsing behavior on the web. Our results capture important aspects of context under the realistic conditions of different online search services, aiming to ensure that our scientific insights and solutions transfer to the operational settings of real world applications.
Transfer Learning for Information Retrieval BIBFull-Text 1061
  Pengfei Li
Enhancing Mathematics Information Retrieval BIBFull-Text 1063
  Martin Líška
Improving Search using Proximity-Based Statistics BIBFull-Text 1065
  Xiaolu Lu
Spoken Conversational Search: Information Retrieval over a Speech-only Communication Channel BIBFull-Text 1067
  Johanne R. Trippas
Finding Answers in Web Search BIBAFull-Text 1069
  Evi Yulianti
There are many informational queries that could be answered with a text passage, thereby not requiring the searcher to access the full web document. When building manual annotations of answer passages for TREC queries, Keikha et al. [6] confirmed that many such queries can be answered with just passages. By presenting the answers directly in the search result page, user information needs will be addressed more rapidly so that reduces user interaction (click) with the search result page [3] and gives a significant positive effect on user satisfaction [2, 7]. In the context of general web search, the problem of finding answer passages has not been explored extensively. Retrieving relevant passages has been studied in TREC HARD track [1] and in INEX [5], but relevant passages are not required to contain answers. One of the tasks in the TREC genomics track [4] was to find answer passages on biomedical literature. Previous work has shown that current passage retrieval methods that focus on topical relevance are not effective at finding answers [6]. Therefore, more knowledge is required to identify answers in a document. Bernstein et al. [2] has studied an approach to extract inline direct answers for search result using paid crowdsourcing service. Such an approach, however, is expensive and not practical to be applied for all possible information needs. A fully automatic process in finding answers remains a research challenge.
   The aim of this thesis is to find passages in the documents that contain answers to a user's query. In this research, we proposed to use a summarization technique through taking advantage of Community Question Answering (CQA) content. In our previous work, we have shown the benefit of using social media to generate more accurate summaries of web documents [8], but this was not designed to present answer in the summary. With the high volume of questions and answers posted in CQA, we believe that there are many questions that have been previously asked in CQA that are the same as or related to actual web queries, for which their best answers can guide us to extract answers in the document. As an initial work, we proposed using term distributions extracted from best answers for top matching questions in one of leading CQA sites, Yahoo! Answers (Y!A), for answer summaries generation. An experiment was done by comparing our summaries with reference answers built in previous work [6], finding some level of success. A manuscript is prepared for this result.
   Next, as an extension of our work above, we were interested to see whether the documents that have better quality answer summaries should be ranked higher in the result list. A set of features are derived from answer summaries to re-rank documents in the result list. Our experiment shows that answer summaries can be used to improve state-of-the-art document ranking. The method is also shown to outperform a current re-ranking approach using comprehensive document quality features. A manuscript was submitted for this result.
   For future work, we plan to conduct deeper analysis on top matching questions and their corresponding best answers from Y!A to better understand their benefit to the generated summaries and re-ranking results. For example, how do the results differ on different relevance level of top best answers from Y!A that were used to generate summaries. There are also opportunities to improve the use of Y!A in generating answer summaries, such as by predicting the quality of best answers from Y!A corresponding to the query. We also aim to combine the related Y!A pages into our initial result list when there is a question from Y!A, which is well matched with the query. Next, it is important to think about an approach to generate answer summaries for the queries that do not have related result from CQA.It is our great pleasure to welcome you to the SIGIR Symposium on Information Retrieval in Practice (SIRIP 2015). The goal of SIRIP is to bring together information retrieval researchers, practitioners, analysts, and consumers, and to achieve knowledge transfer across these boundaries. It is our hope that everyone who attends SIRIP walks away with new understanding and at least one new idea to think about or explore. SIRIP 2015 is held on the third day of the main SIGIR conference, on Wednesday, August 12, 2015, in Santiago, Chile. SIRIP is attended by people registered for the main conference as well as other interested practitioners who chose register for it alone. The SIRIP program consists of three hour and half long sessions that include a mix of invited presentations, refereed papers, and a panel presentation.

Industry Track Invited Talks

From Web Search Relevance to Vertical Search Relevance BIBAFull-Text 1073
  Yi Chang
Web search relevance is a billion dollar challenge, while there is a disadvantage of backwardness in web search competition. Vertical search result can be incorporated to enrich web search content, therefore vertical search relevance is critical to provide differentiated search results. Machine learning based ranking algorithms have shown their effectiveness for both web search and vertical search tasks. In this talk, the speaker will not only introduce state-of-the-art ranking algorithms for web search, but also cover the challenges to improve relevance of various vertical search engines: local search, shopping search, news search, etc.
Finding Money in the Haystack: Information Retrieval at Bloomberg BIBAFull-Text 1075
  Jonathan J. Dorando; Konstantine Arkoudas; Parth Vasa; Gary Kazantsev; Gideon Mann
The financial markets are a rich domain for search, and it is not simple to serving the entire scope of financial professionals, who make their living on accurate, timely, and deep information. The data sources are many and disparate. This includes domains with rich structured data such as company and security attributes, textual data like research reports, and time sensitive news stories. Not only is the domain complicated, but some of the techniques that work for web search have to be adapted and reconsidered in an enterprise context with fewer eyeballs but just as complicated questions. At Bloomberg, we have been addressing these problems over the past four years in the search and discoverability group, heavily leveraging the insights from the academic and open-source communities to apply to our problems. We'll discuss about our efforts in Natural Language Question & Answer (NLQA), learning to rank, federated search, crowd sourcing, and how this all comes together to make search effective for our users.
If SIGIR had an Academic Track, What Would Be In It? BIBAFull-Text 1077
  David Hawking
It used to be the case that very little industry research was presented at SIGIR. Now the balance has radically changed -- many accepted papers have industry authors and many rely on industry data sets -- To the extent that a leading academic member of the SIGIR community has light-heartedly proposed the creation of an Academic Track.
   Behind the levity lies the important question of how a researcher can make a meaningful contribution to the field, in the absence of petabyte-scale sets of documents and massive user-interaction logs. Theoretical contributions can revolutionize thinking, but have greatest impact when applicable in practice, and when empirically validated.
   In my years at Funnelback and more recently at Microsoft I have been very aware of high-impact but not-well-solved IR problems involving relatively tiny datasets. Many of them are characterized by sparsity of user interaction data and are hence not well-suited to simple machine learning approaches or to large scale A/B testing. My talk will illustrate and attempt to characterize these problems and to suggest fruitful areas for academic research.
   If time permits, I will mention some areas in which academic research has contributed to current large-scale industry practice.
WeChat Search & Headline: Sogou Joins Force with Tencent on Mobile Search BIBAFull-Text 1079
  Chao Liu
Tencent Inc. is the biggest social network company in China. Its WeChat and QQ boast of 700 million and 800 million monthly active users (MAU), respectively. Sogou Inc., on the other hand, is a search leader in China, being the No. 2 and No. 3 on mobile/PC search market, respectively. This talk introduces how Tencent and Sogou join force on the battele of mobile search. Specifically, we discuss two products, namely WeChat Search and WeChat Headline, and illustrate how they leverage the strength of both companies to take the market. We further dive under the hood, and examine technical problems and solutions. In particular, we focus on unique aspects due to the particularities of WeChat, e.g., millions of articles generated from about 1 million official accounts, which are forwarded and broadcasted by hundreds of millions of users across the world. Some problems, e.g., de-duplication and ranking, might be similar to traditional IR, but some other aspects are not. In the end, we put forward some challenges for open discussion, and solicit comments from academia and fellow practitioners.
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search BIBAFull-Text 1081
  Asif Makhani
All of us are familiar with search as users. And as software engineers, many of us have worked on search problems in the context of web search, site search, or enterprise search. But search at LinkedIn is different. Our corpus is a richly structured professional graph comprised of 364M+ people, 3M+ companies, 2M+ groups, and 1.5M+ publishers. Our members perform billions of searches (over 5.7B in 2012), and each of those searches is highly personalized based on the searcher's identity and relationships with other professional entities in LinkedIn's economic graph. And all this data is in constant flux as LinkedIn adds more than 2 members every second in over 200 countries (2/3 of our members are outside the United States). As a result, we've built a system quite different from those used for other search applications. In this talk, we will discuss some of the unique challenges we've faced as we deliver highly personalized search over semi-structured data at massive scale.
Location in Search BIBAFull-Text 1083
  Vanessa Murdock
As users turn increasingly to handheld devices to find information, the research community has focused on real-time location signals (GPS signals) to improve search engine effectiveness. Location signals have been investigated for predicting businesses the user will frequent[3], assigning geographic coordinates to media files[1], and to improve mobile search ranking[2]. While the increased focus on real-time user location has produced excellent research, there remains a gap between the capabilities being developed in the research community, and the capabilities being developed by commercial search engines. The core of this discrepancy between the advances in research and advances in industry is understanding the user's location. The vast majority of research on user location assumes that the user's location is known, because the user has provided a GPS signal. For many systems, there is no GPS signal available. The user may choose not enable it, or the system chooses not to prompt the user for the location because doing so degrades the user experience. For these interactions, the system relies on the user's IP address for location information. Further, much of the current research uses public geocoded data such as Foursquare (http://www.foursquare.com visited June 2015), and Twitter (http://www.twitter.com visited June 2015). These data are an incomplete picture of places a user may visit, and are potentially biased in their representation of actual users. The information contained in these data is not the same type of information typically available to a commercial search engine.
   In this talk we discuss gaps between current research on location, and industry advances in using location signals to improve search results. We focus on user location as one example of a gap between research and development.
Challenges and Opportunities in Online Evaluation of Search Engines BIBAFull-Text 1085
  Pavel Serdyukov
Yandex is one of the largest Internet companies in Europe, operating Russia's most popular search engine, generating 58.6% of all search traffic in Russia (as of April 2015). As all modern search engines, Yandex increasingly relies on online evaluation methods such as A/B tests and interleaving. These online evaluation methods test various changes in the search engine by analyzing the changes in the character of its interactions with its users. There are several grand challenges in online evaluation, including the choice of an appropriate online metric and the need to deal the limited number of user interactions available for a search engine for experimentation. In my talk, I will overview our latest research on improving the sensitivity of well-known online metrics, on discovery of more sensitive and robust online metrics, on scheduling and early stopping of online experiments.
Lower Search Cost BIBAFull-Text 1087
  Dou Shen
Web search is actually a pretty heavy task for most users since people need to launch a search engine's portal, phrase the right query and then go through search results to find the right information or service. To lower the search cost, commercial search engines have been improved in many ways, including query suggestion, relevant search, knowledge graph, ranking algorithm, user interface, and so on. I will briefly explain the progress along these features, especially for the largest Chinese search engine -- Baidu. In addition to these approaches, another important way to lower search cost is to make Web search ready whenever a user intends to start a search, which becomes more important with the popularity of mobile devices. I will talk about the progress along this direction and the technologies behind it as well.

Industry Track Refereed Papers

Practical Lessons for Gathering Quality Labels at Scale BIBAFull-Text 1089-1092
  Omar Alonso
Information retrieval researchers and engineers use human computation as a mechanism to produce labeled data sets for product development, research and experimentation. To gather useful results, a successful labeling task relies on many different elements: clear instructions, user interface guidelines, representative high-quality datasets, appropriate inter-rater agreement metrics, work quality checks, and channels for worker feedback. Furthermore, designing and implementing tasks that produce and use several thousands or millions of labels is different than conducting small scale research investigations. In this paper we present a perspective for collecting high quality labels with an emphasis on practical problems and scalability. We focus on three main topics: programming crowds, debugging tasks with low agreement, and algorithms for quality control. We show examples from an industrial setting.
Incremental Sampling of Query Logs BIBAFull-Text 1093-1096
  Ricardo Baeza-Yates
We introduce a simple technique to generate incremental query log samples that mimics well the original query distribution. In this way, editorial judgments for new queries can be consistently added to previous judgments. We also review the problem of how to choose the sample size depending on the types of queries that need to be detected as well as the conditions needed to get a good sample.
Where to Go on Your Next Trip?: Optimizing Travel Destinations Based on User Preferences BIBAFull-Text 1097-1100
  Julia Kiseleva; Melanie J. I. Mueller; Lucas Bernardi; Chad Davis; Ivan Kovacek; Mats Stafseng Einarsen; Jaap Kamps; Alexander Tuzhilin; Djoerd Hiemstra
Recommendation based on user preferences is a common task for e-commerce websites. New recommendation algorithms are often evaluated by offline comparison to baseline algorithms such as recommending random or the most popular items. Here, we investigate how these algorithms themselves perform and compare to the operational production system in large scale online experiments in a real-world application. Specifically, we focus on recommending travel destinations at Booking.com, a major online travel site, to users searching for their preferred vacation activities. To build ranking models we use multi-criteria rating data provided by previous users after their stay at a destination. We implement three methods and compare them to the current baseline in Booking.com: random, most popular, and Naive Bayes. Our general conclusion is that, in an online A/B test with live users, our Naive-Bayes based ranker increased user engagement significantly over the current online system.
Bringing Order to the Job Market: Efficient Job Offer Categorization in E-Recruitment BIBAFull-Text 1101-1104
  Emmanuel Malherbe; Mario Cataldi; Andrea Ballatore
E-recruitment uses a range of web-based technologies to find, evaluate, and hire new personnel for organizations. A crucial challenge in this arena lies in the categorization of job offers: candidates and operators often explore and analyze large numbers of offers and profiles through a set of job categories. To date, recruitment organizations define job categories top-down, relying on standardized vocabularies that often fail to capture new skills and requirements that emerge from dynamic labor markets. In order to support e-recruitment, this paper presents a dynamic, bottom-up method to automatically enrich and revise job categories. The method detects novel, highly characterizing terms in a corpus of job offers, leading to a more effective categorization, and is evaluated on real-world data by Multiposting (http://www.multiposting.fr/en), a large French e-recruitment firm.


Building and Using Models of Information Seeking, Search and Retrieval: Full Day Tutorial BIBAFull-Text 1107-1110
  Leif Azzopardi; Guido Zuccon
Understanding how people interact with information systems when searching is central to the study of Interactive Information Retrieval (IIR). While much of the prior work in this area has either been conceptual, observational or empirical, recently there has been renewed interest in developing mathematical models of information seeking and search. This is because such models can provide a concise and compact representation of search behaviours and naturally generate testable hypotheses about search behaviour. This full day tutorial focuses on explaining and building formal models of Information Seeking and Retrieval. The tutorial is structured into four sessions. In the first session we will discuss the rationale of modelling and examine a number of early formal models of search (including early cost models and the Probability Ranking Principle). Then we will examine more contemporary formal models (including Information Foraging Theory, the Interactive Probability Ranking Principle, and Search Economic Theory). The focus will be on the insights and intuitions that we can glean from the math behind these models. The latter sessions will be dedicated to building models that optimise particular objectives which drive how users make decisions, along with a how-to guide on model building, where we will describe different techniques (including analytical, graphical and computational) that can be used to generate hypotheses from such models. In the final session, participants will be challenged to develop a simple model of interaction applying the techniques learnt during the day, before concluding with an overview of challenges and future directions.
   This tutorial is aimed at participants wanting to know more about the various formal models of information seeking, search and retrieval, that have been proposed in the literature. The tutorial will be presented at an introductory level, and is designed to support participants who want to be able to understand and apply such models, as well as to build their own models.
Advanced Click Models and their Applications to IR: SIGIR 2015 Tutorial BIBAFull-Text 1111-1112
  Aleksandr Chuklin; Ilya Markov; Maarten de Rijke
This tutorial concerns with more advanced and more recent topics in the area of click models. Here, we discuss recent developments in the area with a particular focus on applications of click models. The tutorial features a guest talk and a live demo where participants have a chance to build their own advanced click model.
   While this is the second part of the two half-day tutorials, it is not required for participants to attend the first one. In the beginning of this part, a short introduction to basic click models will be given so that all participants share a common vocabulary. Then, recent advances in click models will be discussed.
An Introduction to Click Models for Web Search: SIGIR 2015 Tutorial BIBAFull-Text 1113-1115
  Aleksandr Chuklin; Ilya Markov; Maarten de Rijke
In this introductory tutorial we give an overview of click models for web search. We show how the framework of probabilistic graphical models help to explain user behavior, build new evaluation metrics and perform simulations. The tutorial is augmented with a live demo where participants have a chance to implement a click model and to test it on a publicly available dataset.
IR Evaluation: Modeling User Behavior for Measuring Effectiveness BIBAFull-Text 1117-1120
  Charles L. A. Clarke; Mark D. Smucker; Emine Yilmaz
This half-day tutorial on IR evaluation combines an introduction to classical IR evaluation methods with material on more recent user-oriented approaches. We primarily focus on off-line evaluation, but some material on on-line evaluation is also covered. The broad goal of the tutorial is to equip researchers with an understanding of modern approaches to IR evaluation, facilitating new research on this topic and improving evaluation methodology for emerging areas.
Information Retrieval with Verbose Queries BIBAFull-Text 1121-1124
  Manish Gupta; Michael Bendersky
Recently, the focus of many novel search applications shifted from short keyword queries to verbose natural language queries. Examples include question answering systems and dialogue systems, voice search on mobile devices and entity search engines like Facebook's Graph Search or Google's Knowledge Graph. However the performance of textbook information retrieval techniques for such verbose queries is not as good as that for their shorter counterparts. Thus, effective handling of verbose queries has become a critical factor for adoption of information retrieval techniques in this new breed of search applications. Over the past decade, the information retrieval community has deeply explored the problem of transforming natural language verbose queries using operations like reduction, weighting, expansion, reformulation and segmentation into more effective structural representations. However, thus far, there was not a coherent and organized tutorial on this topic. In this tutorial, we aim to put together various research pieces of the puzzle, provide a comprehensive and structured overview of various proposed methods, and also list various application scenarios where effective verbose query processing can make a significant difference.
Revisiting the Foundations of IR: Timeless, Yet Timely BIBAFull-Text 1125-1127
  Paul B. Kantor
As we face an explosion of potential new applications for the fundamental concepts and technologies of information retrieval, ranging from ad ranking to social media, from collaborative recommending to question answering systems, many researchers are spending unnecessary time reinventing ideas and relationships that are buried in the prehistory of information retrieval (which, for many researchers, means anything published before they entered graduate school). Much of today's received wisdom may be nothing more than the fossilized residue of lively debates concerning such things as estimation of value and evaluation of systems. Returning to those discussions may open the door to genuinely new insights. On the other hand, of the ideas that surface as "new" in today's super-heated research environment have very firm roots in earlier developments in fields as diverse as citation analysis, statistics, and pattern recognition. The purpose of this tutorial is to survey those roots, and their relation to the contemporary fruits on the tree of information retrieval, and to separate, as much as is possible in an era of increasing commercial secrecy about methods, the problems to be solved, the algorithms for solving them, and the heuristics that are the bread and butter of a working operation.
   Among the important new topics whose foundations will be explored are the use of social media in search and advertising, and the growing management of personal image collections for search and for commercial purposes.
   While some might think that an examination of the roots is of merely historical interest, it has practical value as well. When you know which earlier research has provided the origins for the things that you are interested in, you can use that fact to trace its other descendents, and often find rich and rewarding ideas in a literature that you would not normally reach, because it was not considered important by your instructors when you were learning about the problems. In addition to pattern recognition and citation analysis, the tutorial will also expose and review some of the relations to the fields of statistics and operations research.
   Participants will become familiar with roots in Pattern Analysis, Statistics, Information Science and other sources of key ideas that reappear in the current development of Information Retrieval as it applies to Search Engines, Social Media, and Collaborative Systems. They will be able to separate problems from algorithms, and algorithms from heuristics, in the application of these ideas to their own research and/or development activities. Course materials will be made available on a Web site two weeks prior to the tutorial. They will include links to relevant software; links to publications that will be discussed; and mechanisms for chat among the tutorial participants, before, during and after the tutorial.
IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline BIBAFull-Text 1129-1132
  Jin Young Kim; Emine Yilmaz
This tutorial aims to provide attendees with a detailed understanding of end-to-end evaluation pipeline based on human judgments (offline measurement). The tutorial will give an overview of the state of the art methods, techniques, and metrics necessary for each stage of evaluation process. We will mostly focus on evaluating an information retrieval (search) system, but the other tasks such as recommendation and classification will also be discussed. Practical examples will be drawn both from the literature and from real world usage scenarios in industry.
Music Retrieval and Recommendation: A Tutorial Overview BIBAFull-Text 1133-1136
  Peter Knees; Markus Schedl
In this tutorial, we give an introduction to the field of and state of the art in music information retrieval (MIR). The tutorial particularly spotlights the question of music similarity, which is an essential aspect in music retrieval and recommendation. Three factors play a central role in MIR research: (1) the music content, i.e., the audio signal itself, (2) the music context, i.e., metadata in the widest sense, and (3) the listeners and their contexts, manifested in user-music interaction traces. We review approaches that extract features from all three data sources and combinations thereof and show how these features can be used for (large-scale) music indexing, music description, music similarity measurement, and recommendation. These methods are further showcased in a number of popular music applications, such as automatic playlist generation and personalized radio stationing, location-aware music recommendation, music search engines, and intelligent browsing interfaces. Additionally, related topics such as music identification, automatic music accompaniment and score following, and search and retrieval in the music production domain are discussed.
Exploiting Wikipedia for Information Retrieval Tasks BIBAFull-Text 1137-1140
  Bracha Shapira; Nir Ofek; Victor Makarenkov
Wikipedia -- the online encyclopedia -- has long been used as a source of information for researchers, as well as being a subject of research itself. Wikipedia has been shown to be effective in recommender systems, sentiment analysis, validation and multiple domains in information retrieval. One of the reasons for Wikipedia's popularity among researchers and practitioners is the multiple types of information it contains, which enables practitioners to select the right "tool" for their respective tasks. In addition to its great potential, this multitude of information sources also poses a challenge: which sources of information are best suited for a specific problem and how can different types of data be combined? This tutorial aims to provide a holistic view of Wikipedia's different features -- text, links, categories, page views, editing history etc. -- and explore the different ways they can be utilized in a machine learning framework. By presenting and contrasting the latest works that utilize Wikipedia in multiple domains, this tutorial aims to increase the awareness among researchers and practitioners in these fields to the benefits of utilizing Wikipedia in their respective domains, in particular to the use of multiple sources of information simultaneously.We are pleased to introduce the Workshop Program for the 38th Annual SIGIR Conference. We received 14 workshop proposals, each of which was peer-reviewed by three members of the Workshops PC. After discussion of all submissions in the Workshops PC, as well as with the PC Chairs of the technical program, 7 workshops were accepted (50% acceptance rate). We sought to include topics that covered the breadth of expertise in the SIGIR community, would appeal to a diverse range of SIGIR attendees, and would push the state-of-the-art in IR research. We greatly appreciate all authors who submitted a proposal for consideration and all reviewers for their help in selecting which proposals to include in the program. Finally, we are grateful to Microsoft Research for providing workshop fee waivers for thirty-five students.


Web Question Answering: Beyond Factoids: SIGIR 2015 Workshop BIBFull-Text 1143
  Eugene Agichtein; David Carmel; Charles L. A. Clarke; Praveen Paritosh; Dan Pelleg; Idan Szpektor
Graph Search and Beyond: SIGIR 2015 Workshop Summary BIBAFull-Text 1145-1146
  Omar Alonso; Marti A. Hearst; Jaap Kamps
Modern Web data is highly structured in terms of entities and relations from large knowledge resources, geo-temporal references and social network structure, resulting in a massive multidimensional graph. This graph essentially unifies both the searcher and the information resources that played a fundamentally different role in traditional IR, and "Graph Search" offers major new ways to access relevant information. Graph search affects both query formulation (complex queries about entities and relations building on the searcher's context) as well as result exploration and discovery (slicing and dicing the information using the graph structure) in a completely personalized way. This new graph based approach introduces great opportunities, but also great challenges, in terms of data quality and data integration, user interface design, and privacy. We view the notion of "graph search" as searching information from your personal point of view (you are the query) over a highly structured and curated information space. This goes beyond the traditional two-term queries and ten blue links results that users are familiar with, requiring a highly interactive session covering both query formulation and result exploration. The workshop attracted a range of researchers working on this and related topics, and made concrete progress working together on one of the greatest challenges in the years to come.
SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) BIBFull-Text 1147-1148
  Jaime Arguello; Fernando Diaz; Jimmy Lin; Andrew Trotman
SIGIR 2015 Workshop on Temporal, Social and Spatially-aware Information Access (#TAIA2015) BIBAFull-Text 1149-1150
  Klaus Berberich; James Caverlee; Miles Efron; Claudia Hauff; Vanessa Murdock; Milad Shokouhi; Bart Thomee
In this workshop we aim to bring together practitioners and researchers to discuss their recent breakthroughs and the challenges with addressing spatial and temporal information access, both from the algorithmic and the architectural perspectives.
NeuroIR 2015: Neuro-Physiological Methods in IR Research BIBAFull-Text 1151-1153
  Jacek Gwizdka; Joemon Jose; Javed Mostafa; Max Wilson
This Tutorial+Workshop will discuss opportunities and challenges involved in using neuro-physiological tools/techniques (such as fMRI, fNIRS, EEG, eye-tracking, GSR, HR, and facial expressions) and theories in information retrieval. The hybrid format will engage researchers and students at different levels of expertise, from those who are active in this area to those who are interested and want to learn more. The workshop will combine presentations, discussions and tutorial elements and consist of four segments (tutorial, completed research, work-in-progress, closing panel).
SPS'15: 2015 International Workshop on Social Personalization & Search BIBFull-Text 1155
  Christoph Trattner; Denis Parra; Peter Brusilovsky; Leandro Marinho
Privacy-Preserving IR 2015: When Information Retrieval Meets Privacy and Security BIBAFull-Text 1157-1158
  Hui Yang; Ian Soboroff
Information retrieval (IR) and information privacy/security are two fast-growing computer science disciplines. There are many synergies and connections between these two disciplines. However, there have been very limited efforts to connect the two important disciplines. On the other hand, due to lack of mature techniques in privacy-preserving IR, concerns about information privacy and security have become serious obstacles that prevent valuable user data to be used in IR research such as studies on query logs, social media, tweets, and medical record retrieval. We propose this privacy-preserving IR workshop to connect the two disciplines of information retrieval and information privacy and security. We look forward to spurring research that aims to bring together the research fields of IR and privacy/security. Last year, the first privacy-preserving IR workshop focused on mitigating privacy threats in information retrieval by novel algorithms and tools that enable web users to better understand associated privacy risks.