HCI Bibliography Home | HCI Conferences | About CIKM | CIKM Conf Proceedings | Detailed Records | RefWorks | EndNote | Hide Abstracts
CIKM Tables of Contents: 080910111213 ⇐ MORE

Proceedings of the 2012 ACM Conference on Information and Knowledge Management

Fullname:Proceedings of the 21st ACM international Conference on Information and Knowledge Management
Editors:Xuewen Chen; Guy Lebanon; Haixun Wang; Mohammed J. Zaki
Location:Maui, Hawaii
Dates:2012-Oct-29 to 2012-Nov-02
Standard No:ISBN: 978-1-4503-1156-4; ACM DL: Table of Contents; hcibib: CIKM12
Links:Conference Website
Summary:On behalf of the organizing committee, it is my genuine honor and great pleasure to welcome you to the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) in Maui, Hawaii! I hope this conference proves to be both interesting and beneficial.
    Since its inception, the CIKM conference has provided a unique international forum for the presentation, discussion and dissemination of research findings in data management, information retrieval and knowledge management. The purpose of the conference is to identify challenging problems facing the development of future knowledge and information systems and to shape future research directions though the publication of high quality, applied and theoretical research findings. The conference has been a leading forum in which experts from academic, industry and the public sector gather to exchange ideas, research achievements and technical developments in multidisciplinary research areas.
    As one of the world's most recognized conferences in the field, this year's CIKM conference has received a record high number of submissions in the history of CIKM, as can be seen from the following statistics:
* 1492 abstracts submitted
* 1088 full papers, 229 posters, and 70 demo papers submitted
* 146 papers accepted for presentation as full papers (13.4% acceptance
    rate) and 157 papers were accepted for short papers (27.8% cumulative
    acceptance rate) The increased number of submissions alone is a great demonstration of the lively research areas that contribute to the CIKM area. In addition, CIKM 2012 will host 15 workshops on cutting-edge areas of research and a dedicated Industry Event featuring leading industrial practitioners. We are grateful to all authors who chose to submit their work to CIKM 2012 and are very excited by the final program.
    CIKM values interdisciplinary research and we are proud to present three keynote speakers for the main conference (Dr. Ricardo Baeza-yates, Prof. William Cohen, and Prof. Jeffrey S. Vitter) and four keynote speakers for the Industry Event (Drs. Eric Brill, Raghu Ramakrishnan, Tom Malloy, and Xuedong Huang), all of whom will give presentations that cross discipline boundaries. I deeply appreciate their time and commitment to deliver their speeches and share their cutting-edge research experiences and insightful comments in their research topics.
  1. Keynote address
  2. KM track: recommender systems
  3. KM track: pattern mining
  4. IR track: evaluation methodologies
  5. IR track: social media search
  6. KM track: link and graph mining
  7. IR track: language technologies
  8. DB track: graph and knowledge base
  9. DB track: temporal, spatial and multimedia databases
  10. KM track: matrix methods and anomaly detection
  11. KM track: social networks
  12. IR track: advertising
  13. IR track: system architecture, distributed IR, scalability
  14. KM track: advertisement and products
  15. KM track: clustering
  16. IR track: recommendation systems
  17. IR track: digital libraries and citation analysis
  18. KM track: text mining
  19. IR track: formal retrieval models and learning to rank
  20. DB track: probabilistic and uncertain data
  21. DB track: top-k and nearest neighbor queries
  22. KM track: spatial and temporal methods
  23. IR track: web search
  24. DB track: web data management
  25. KM track: information extraction
  26. IR track: topic modeling and content and sentiment analysis
  27. DB track: query processing, optimization and performance
  28. KM track: classification and semantic methods
  29. IR track: multimedia and user feedback
  30. DB track: emerging and advanced topics
  31. KM track: novel applications
  32. IR track: social networks
  33. Knowledge management short paper session
  34. Information retrieval short paper session
  35. Databases short paper session
  36. Knowledge management poster session
  37. Information retrieval poster session
  38. Databases poster session
  39. Knowledge management demonstration session
  40. Information retrieval demonstration session
  41. Databases demonstration session
  42. Workshop summaries

Keynote address

User engagement: the network effect matters! BIBAFull-Text 1-2
  Ricardo Baeza-Yates; Mounia Lalmas
In the online world, user engagement refers to the quality of the user experience that emphasizes the positive aspects of the interaction with a web application and, in particular, the phenomena associated with wanting to use that application longer and frequently. This definition is motivated by the observation that successful web applications are not just used, but they are engaged with. Users invest time, attention, and emotion into them.
   Online providers aim not only to engage users with each service, but across all services in their network. They spend increasing effort to direct users to various services (e.g. using hyperlinks to help users navigate to and explore other services), to increase user traffic between their services. Nothing is known for users engaging across such a network of Web sites, something we call networked user engagement. We address this problem by combining techniques from web analytics and mining, information retrieval evaluation, and existing works on user engagement coming from the domains of information science, multimodal human computer interaction and cognitive psychology. In this way, we can combine insights from big data with deep analysis of human behavior in the lab or through crowd-sourcing experiments.
Learning similarity measures based on random walks BIBAFull-Text 3
  Willia W. Cohen
We describe a novel learnable proximity measure based on personalized PageRank (also known as "random walk with reset"). Instead of introducing one weight per edge label, as in most prior work, we introduce one weight for each edge label sequence. We show that this approach is advantageous for a number of real-world tasks, including querying graph databases, recommendation tasks, and inference in large, noisy knowledge bases.
Compressed data structures with relevance BIBAFull-Text 4-5
  Jeffrey Scott Vitter
We describe recent breakthroughs in the field of compressed data structures, in which the data structure is stored in a compressed representation that still allows fast answers to queries. We focus in particular on compressed data structures to support the important application of pattern matching on massive document collections. Given an arbitrary query pattern in textual form, the job of the data structure is to report all the locations where the pattern appears. Another variant is to report all the documents that contain at least one instance of the pattern. We are particularly interested in reporting only the most relevant documents, using a variety of notions of relevance. We discuss recently developed techniques that support fast search in these contexts as well as under additional positional and temporal constraints.

KM track: recommender systems

LogUCB: an explore-exploit algorithm for comments recommendation BIBAFull-Text 6-15
  Dhruv Kumar Mahajan; Rajeev Rastogi; Charu Tiwari; Adway Mitra
The highly dynamic nature of online commenting environments makes accurate ratings prediction for new comments challenging. In such a setting, in addition to exploiting comments with high predicted ratings, it is also critical to explore comments with high uncertainty in the predictions. In this paper, we propose a novel upper confidence bound (UCB) algorithm called LOGUCB that balances exploration with exploitation when the average rating of a comment is modeled using logistic regression on its features. At the core of our LOGUCB algorithm lies a novel variance approximation technique for the Bayesian logistic regression model that is used to compute the UCB value for each comment. In experiments with a real-life comments dataset from Yahoo! News, we show that LOGUCB with bag-of-words and topic features outperforms state-of-the-art explore-exploit algorithms.
DQR: a probabilistic approach to diversified query recommendation BIBAFull-Text 16-25
  Ruirui Li; Ben Kao; Bin Bi; Reynold Cheng; Eric Lo
Web search queries issued by casual users are often short and with limited expressiveness. Query recommendation is a popular technique employed by search engines to help users refine their queries. Traditional similarity-based methods, however, often result in redundant and monotonic recommendations. We identify five basic requirements of a query recommendation system. In particular, we focus on the requirements of redundancy-free and diversified recommendations. We propose the DQR framework, which mines a search log to achieve two goals: (1) It clusters search log queries to extract query concepts, based on which recommended queries are selected. (2) It employs a probabilistic model and a greedy heuristic algorithm to achieve recommendation diversification. Through a comprehensive user study we compare DQR against five other recommendation methods. Our experiment shows that DQR outperforms the other methods in terms of relevancy, diversity, and ranking performance of the recommendations.
Dynamic covering for recommendation systems BIBAFull-Text 26-34
  Ioannis Antonellis; Anish Das Sarma; Shaddin Dughmi
In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic covering (SDC), arising in many modern-day web applications, including ad-serving and online recommendation systems such as in eBay, Netflix, and Amazon. Roughly speaking, SDC applies two restrictions to the well-studied Max-Coverage problem [14]: Given an integer k, X={1,2,...,n} and I={S_1,...,S_m}, S_i subseteq X, find |J| subseteq I, such that |J| < k and (union_S_in_J S) is as large as possible. The two restrictions applied by SDC are: (1) Dynamic: At query-time, we are given a query Q subseteq X, and our goal is to find J such that Q bigcap (union_S_J S) is as large as possible; Space-constrained: We don't have enough space to store (and process) the entire input; specifically, we have o(mn), and maybe as little as O((m+n)polylog(mn)) space. A solution to SDC maintains a small data structure, and uses this data structure to answer most dynamic queries with high accuracy. We call such a scheme a Coverage Oracle.
   We present algorithms and complexity results for coverage oracles. We present deterministic and probabilistic near-tight upper and lower bounds on the approximation ratio of SDC as a function of the amount of space available to the oracle. Our lower bound results show that to obtain constant-factor approximations we need Omega(mn) space. Fortunately, our upper bounds present an explicit tradeoff between space and approximation ratio, allowing us to determine the amount of space needed to guarantee certain accuracy.
MEET: a generalized framework for reciprocal recommender systems BIBAFull-Text 35-44
  Lei Li; Tao Li
Reciprocal recommender systems refer to systems from which users can obtain recommendations of other individuals by satisfying preferences of both parties being involved. Different from the traditional user-item recommendation, reciprocal recommenders focus on the preferences of both parties simultaneously, as well as some special properties in terms of "reciprocal". In this paper, we propose MEET -- a generalized framework for reciprocal recommendation, in which we model the correlations of users as a bipartite graph that maintains both local and global "reciprocal" utilities. The local utility captures users' mutual preferences, whereas the global utility manages the overall quality of the entire reciprocal network. Extensive empirical evaluation on two real-world data sets (online dating and online recruiting) demonstrates the effectiveness of our proposed framework compared with existing recommendation algorithms. Our analysis also provides deep insights into the special aspects of reciprocal recommenders that differentiate them from user-item recommender systems.
Social contextual recommendation BIBAFull-Text 45-54
  Meng Jiang; Peng Cui; Rui Liu; Qiang Yang; Fei Wang; Wenwu Zhu; Shiqiang Yang
Exponential growth of information generated by online social networks demands effective recommender systems to give useful results. Traditional techniques become unqualified because they ignore social relation data; existing social recommendation approaches consider social network structure, but social context has not been fully considered. It is significant and challenging to fuse social contextual factors which are derived from users' motivation of social behaviors into social recommendation. In this paper, we investigate social recommendation on the basis of psychology and sociology studies, which exhibit two important factors: individual preference and interpersonal influence. We first present the particular importance of these two factors in online item adoption and recommendation. Then we propose a novel probabilistic matrix factorization method to fuse them in latent spaces. We conduct experiments on both Facebook style bidirectional and Twitter style unidirectional social network datasets in China. The empirical result and analysis on these two large datasets demonstrate that our method significantly outperform the existing approaches.

KM track: pattern mining

Mining high utility itemsets without candidate generation BIBAFull-Text 55-64
  Mengchi Liu; Junfeng Qu
High utility itemsets refer to the sets of items with high utility like profit in a database, and efficient mining of high utility itemsets plays a crucial role in many real-life applications and is an important research issue in data mining area. To identify high utility itemsets, most existing algorithms first generate candidate itemsets by overestimating their utilities, and subsequently compute the exact utilities of these candidates. These algorithms incur the problem that a very large number of candidates are generated, but most of the candidates are found out to be not high utility after their exact utilities are computed. In this paper, we propose an algorithm, called HUI-Miner (High Utility Itemset Miner), for high utility itemset mining. HUI-Miner uses a novel structure, called utility-list, to store both the utility information about an itemset and the heuristic information for pruning the search space of HUI-Miner. By avoiding the costly generation and utility computation of numerous candidate itemsets, HUI-Miner can efficiently mine high utility itemsets from the utility-lists constructed from a mined database. We compared HUI-Miner with the state-of-the-art algorithms on various databases, and experimental results show that HUI-Miner outperforms these algorithms in terms of both running time and memory consumption.
A general framework to encode heterogeneous information sources for contextual pattern mining BIBAFull-Text 65-74
  Weishan Dong; Wei Fan; Lei Shi; Changjin Zhou; Xifeng Yan
Traditional pattern mining methods usually work on single data sources. However, in practice, there are often multiple and heterogeneous information sources. They collectively provide contextual information not available in any single source alone describing the same set of objects, and are useful for discovering hidden contextual patterns. One important challenge is to provide a general methodology to mine contextual patterns easily and efficiently. In this paper, we propose a general framework to encode contextual information from multiple sources into a coherent representation -- Contextual Information Graph (CIG). The complexity of the encoding scheme is linear in both time and space. More importantly, CIG can be handled by any single-source pattern mining algorithms that accept taxonomies without any modification. We demonstrate by three applications of the contextual association rule, sequence and graph mining, that contextual patterns providing rich and insightful knowledge can be easily discovered by the proposed framework. It enables Contextual Pattern Mining (CPM) by reusing single-source methods, and is easy to deploy and use in real-world systems.
Incorporating occupancy into frequent pattern mining for high quality pattern recommendation BIBAFull-Text 75-84
  Linpeng Tang; Lei Zhang; Ping Luo; Min Wang
Mining interesting patterns from transaction databases has attracted a lot of research interest for more than a decade. Most of those studies use frequency, the number of times a pattern appears in a transaction database, as the key measure for pattern interestingness. In this paper, we introduce a new measure of pattern interestingness, occupancy. The measure of occupancy is motivated by some real-world pattern recommendation applications which require that any interesting pattern X should occupy a large portion of the transactions it appears in. Namely, for any supporting transaction t of pattern X, the number of items in X should be close to the total number of items in t. In these pattern recommendation applications, patterns with higher occupancy may lead to higher recall while patterns with higher frequency lead to higher precision. With the definition of occupancy we call a pattern dominant if its occupancy is above a user-specified threshold. Then, our task is to identify the qualified patterns which are both frequent and dominant. Additionally, we also formulate the problem of mining top-k qualified patterns: finding the qualified patterns with the top-k values of any function (e.g. weighted sum of both occupancy and support).
   The challenge to these tasks is that the monotone or anti-monotone property does not hold on occupancy. In other words, the value of occupancy does not increase or decrease monotonically when we add more items to a given itemset. Thus, we propose an algorithm called DOFIA (DOminant and Frequent Itemset mining Algorithm), which explores the upper bound properties on occupancy to reduce the search process. The tradeoff between bound tightness and computational complexity is also systematically addressed. Finally, we show the effectiveness of DOFIA in a real-world application on print-area recommendation for Web pages, and also demonstrate the efficiency of DOFIA on several large synthetic data sets.
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce BIBAFull-Text 85-94
  Matteo Riondato; Justin A. DeBrabant; Rodrigo Fonseca; Eli Upfal
Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.
Interactive pattern mining on hidden data: a sampling-based solution BIBAFull-Text 95-104
  Mansurul Bhuiyan; Snehasis Mukhopadhyay; Mohammad Al Hasan
Mining frequent patterns from a hidden dataset is an important task with 43 various real-life applications. In this research, we propose a solution to this problem that is based on Markov Chain Monte Carlo (MCMC) sampling of frequent patterns. Instead of returning all the frequent patterns, the proposed paradigm returns a small set of randomly selected patterns so that the clandestinity of the dataset can be maintained. Our solution also allows interactive sampling, so that the sampled patterns can fulfill the user's requirement effectively. We show experimental results from several real life datasets to validate the capability and usefulness of our solution; in particular, we show examples that by using our proposed solution, an eCommerce marketplace can allow pattern mining on user session data without disclosing the data to the public; such a mining paradigm helps the sellers of the marketplace, which eventually boost the marketplace's own revenue.

IR track: evaluation methodologies

An analysis of systematic judging errors in information retrieval BIBAFull-Text 105-114
  Gabriella Kazai; Nick Craswell; Emine Yilmaz; S. M. M. Tahaghoghi
Test collections are powerful mechanisms for the evaluation and optimization of information retrieval systems. However, there is reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. This paper examines such effects in a web search setting, comparing the judgments of four groups of judges: NIST Web Track judges, untrained crowd workers and two groups of trained judges of a commercial search engine. Our goal is to identify systematic judging errors by comparing the labels contributed by the different groups, working under the same or different judging guidelines. In particular, we focus on detecting systematic differences in judging depending on specific characteristics of the queries and URLs. For example, we ask whether a given population of judges, working under a given set of judging guidelines, are more likely to consistently overrate Wikipedia pages than another group judging under the same instructions. Our approach is to identify judging errors with respect to a consensus set, a judged gold set and a set of user clicks. We further demonstrate how such biases can affect the training of retrieval systems.
On caption bias in interleaving experiments BIBAFull-Text 115-124
  Katja Hofmann; Fritz Behr; Filip Radlinski
Information retrieval evaluation most often involves manually assessing the relevance of particular query-document pairs. In cases where this is difficult (such as personalized search), interleaved comparison methods are becoming increasingly common. These methods compare pairs of ranking functions based on user clicks on search results, thus better reflecting true user preferences. However, by depending on clicks, there is a potential for bias. For example, users have been previously shown to be more likely to click on results with attractive titles and snippets. An interleaving evaluation where one ranker tends to generate results that attract more clicks (without being more relevant) may thus be biased.
   We present an approach for detecting and compensating for this type of bias in interleaving evaluations. Introducing a new model of caption bias, we propose features that model bias based on (1) per-document effects, and (2) the (pairwise) relationships between a document and surrounding documents. We show that our model can effectively capture click behavior, with best results achieved by a model that combines both per-document and pairwise features. Applying this model to re-weight observed user clicks, we find a small overall effect on real interleaving comparisons, but also identify a case where initially detected preferences vanish after caption bias re-weighting is applied. Our results indicate that our model of caption bias is effective and can successfully identify interleaving experiments affected by caption bias.
Alternative assessor disagreement and retrieval depth BIBAFull-Text 125-134
  William Webber; Praveen Chandar; Ben Carterette
Assessors are well known to disagree frequently on the relevance of documents to a topic, but the factors leading to assessor disagreement are still poorly understood. In this paper, we examine the relationship between the rank at which a document is returned by a set of retrieval systems and the likelihood that a second assessor will disagree with the relevance assessment of the initial assessor, and find that there is a strong and consistent correlation between the two. We adopt a metarank method of summarizing a document's rank across multiple runs, and propose a logistic regression predictive model of second assessor disagreement given metarank and initially-assessed relevance. The consistency of the model parameters across different topics, assessor pairs, and collections is considered. The model gives comparatively accurate predictions of absolute system scores, but less consistent predictions of relative scores than a simpler rank-insensitive model. We demonstrate that the logistic regression model is robust to using sampled, rather than exhaustive, dual assessment. We demonstrate the use of the sampled predictive model to incorporate assessor disagreement into tests of statistical significance.
Incorporating variability in user behavior into systems based evaluation BIBAFull-Text 135-144
  Ben Carterette; Evangelos Kanoulas; Emine Yilmaz
Click logs present a wealth of evidence about how users interact with a search system. This evidence has been used for many things: learning rankings, personalizing, evaluating effectiveness, and more. But it is almost always distilled into point estimates of feature or parameter values, ignoring what may be the most salient feature of users -- their variability. No two users interact with a system in exactly the same way, and even a single user may interact with results for the same query differently depending on information need, mood, time of day, and a host of other factors. We present a Bayesian approach to using logs to compute posterior distributions for probabilistic models of user interactions. Since they are distributions rather than point estimates, they naturally capture variability in the population. We show how to cluster posterior distributions to discover patterns of user interactions in logs, and discuss how to use the clusters to evaluate search engines according to a user model. Because the approach is Bayesian, our methods can be applied to very large logs (such as those possessed by Web search engines) as well as very small (such as those found in almost any other setting).
Constructing test collections by inferring document relevance via extracted relevant information BIBAFull-Text 145-154
  Shahzad Rajput; Matthew Ekstrand-Abueg; Virgil Pavlu; Javed A. Aslam
The goal of a typical information retrieval system is to satisfy a user's information need -- e.g., by providing an answer or information "nugget" -- while the actual search space of a typical information retrieval system consists of documents -- i.e., collections of nuggets. In this paper, we characterize this relationship between nuggets and documents and discuss applications to system evaluation.
   In particular, for the problem of test collection construction for IR system evaluation, we demonstrate a highly efficient algorithm for simultaneously obtaining both relevant documents and relevant information. Our technique exploits the mutually reinforcing relationship between relevant documents and relevant information, yielding document-based test collections whose efficiency and efficacy exceed those of typical Cranfield-style test collections, while also generating sets of highly relevant information.

IR track: social media search

Twevent: segment-based event detection from tweets BIBAFull-Text 155-164
  Chenliang Li; Aixin Sun; Anwitaman Datta
Event detection from tweets is an important task to understand the current events/topics attracting a large number of common users. However, the unique characteristics of tweets (e.g. short and noisy content, diverse and fast changing topics, and large data volume) make event detection a challenging task. Most existing techniques proposed for well written documents (e.g. news articles) cannot be directly adopted. In this paper, we propose a segment-based event detection system for tweets, called Twevent. Twevent first detects bursty tweet segments as event segments and then clusters the event segments into events considering both their frequency distribution and content similarity. More specifically, each tweet is split into non-overlapping segments (i.e. phrases possibly refer to named entities or semantically meaningful information units). The bursty segments are identified within a fixed time window based on their frequency patterns, and each bursty segment is described by the set of tweets containing the segment published within that time window. The similarity between a pair of bursty segments is computed using their associated tweets. After clustering bursty segments into candidate events, Wikipedia is exploited to identify the realistic events and to derive the most newsworthy segments to describe the identified events. We evaluate Twevent and compare it with the state-of-the-art method using 4.3 million tweets published by Singapore-based users in June 2010. In our experiments, Twevent outperforms the state-of-the-art method by a large margin in terms of both precision and recall. More importantly, the events detected by Twevent can be easily interpreted with little background knowledge because of the newsworthy segments. We also show that Twevent is efficient and scalable, leading to a desirable solution for event detection from tweets.
Making your interests follow you on Twitter BIBAFull-Text 165-174
  Marco Pennacchiotti; Fabrizio Silvestri; Hossein Vahabi; Rossano Venturini
In this paper we introduce the task of "tweet recommendation", the problem of suggesting tweets that match a user's interests and likes. We propose an Information-Retrieval-like model that leverages the content of the user's tweets and those of her friends, and that effectively retrieves a set of tweets that is personalized and varied in nature. Our approach could be easily leveraged to build, for example, a Twitter or Facebook timeline that collects messages that are of interest for the user, but that are not posted by her friends. We compare to typical approaches used in similar tasks, reporting significant gains in terms of overall precision, up to about +20%, on both a corpus-based evaluation and real world user study.
Generating event storylines from microblogs BIBAFull-Text 175-184
  Chen Lin; Chun Lin; Jingxuan Li; Dingding Wang; Yang Chen; Tao Li
Microblogging service has emerged to be a dominant web medium for billions of individuals sharing and spreading instant news and information, therefore monitoring the event evolution on microblog sphere is crucial for providing both better user experience and deeper understanding on real-time events. In this paper we explore the problem of generating storylines from microblogs for user input queries. This problem is challenging due to the sparse, dynamic and social nature of microblogs. Given a query of an ongoing event, we propose to sketch the real-time storyline of the event by a two-level solution. We first propose a language model with dynamic pseudo relevance feedback to obtain relevant tweets, and then generate storylines via graph optimization. Comprehensive experiments on Twitter data sets demonstrate the effectiveness of the proposed methods in each level and the overall framework.
Social book search: comparing topical relevance judgements and book suggestions for evaluation BIBAFull-Text 185-194
  Marijn Koolen; Jaap Kamps; Gabriella Kazai
The Web and social media give us access to a wealth of information, not only different in quantity but also in character -- traditional descriptions from professionals are now supplemented with user generated content. This challenges modern search systems based on the classical model of topical relevance and ad hoc search: How does their effectiveness transfer to the changing nature of information and to the changing types of information needs and search tasks? We use the INEX 2011 Books and Social Search Track's collection of book descriptions from Amazon and social cataloguing site LibraryThing. We compare classical IR with social book search in the context of the LibraryThing discussion forums where members ask for book suggestions. Specifically, we compare book suggestions on the forum with Mechanical Turk judgements on topical relevance and recommendation, both the judgements directly and their resulting evaluation of retrieval systems. First, the book suggestions on the forum are a complete enough set of relevance judgements for system evaluation. Second, topical relevance judgements result in a different system ranking from evaluation based on the forum suggestions. Although it is an important aspect for social book search, topical relevance is not sufficient for evaluation. Third, professional metadata alone is often not enough to determine the topical relevance of a book. User reviews provide a better signal for topical relevance. Fourth, user-generated content is more effective for social book search than professional metadata. Based on our findings, we propose an experimental evaluation that better reflects the complexities of social book search.
Content-based crowd retrieval on the real-time web BIBAFull-Text 195-204
  Krishna Y. Kamath; James Caverlee
In this paper, we propose and evaluate a novel content-driven crowd discovery algorithm that can efficiently identify newly-formed communities of users from the real-time web. Short-lived crowds reflect the real-time interests of their constituents and provide a foundation for user-focused web monitoring. Three of the salient features of the algorithm are its: (i) prefix-tree based locality-sensitive hashing approach for discovering crowds from high-volume rapidly-evolving social media; (ii) efficient user profile updating for incorporating new user activities and fading older ones; and (iii) key dimension identification, so that crowd detection can be focused on the most active portions of the real-time web. Through extensive experimental study, we find significantly more efficient crowd discovery as compared to both a k-means clustering-based approach and a MapReduce-based implementation, while maintaining high-quality crowds as compared to an offline approach. Additionally, we find that expert crowds tend to be "stickier" and last longer in comparison to crowds of typical users.

KM track: link and graph mining

Graph classification: a diversified discriminative feature selection approach BIBAFull-Text 205-214
  Yuanyuan Zhu; Jeffrey Xu Yu; Hong Cheng; Lu Qin
A graph models complex structural relationships among objects, and has been prevalently used in a wide range of applications. Building an automated graph classification model becomes very important for predicting unknown graphs or understanding complex structures between different classes. The graph classification framework being widely used consists of two steps, namely, feature selection and classification. The key issue is how to select important subgraph features from a graph database with a large number of graphs including positive graphs and negative graphs. Given the features selected, a generic classification approach can be used to build a classification model. In this paper, we focus on feature selection. We identify two main issues with the most widely used feature selection approach which is based on a discriminative score to select frequent subgraph features, and introduce a new diversified discriminative score to select features that have a higher diversity. We analyze the properties of the newly proposed diversified discriminative score, and conducted extensive performance studies to demonstrate that such a diversified discriminative score makes positive/negative graphs separable and leads to a higher classification accuracy.
Multi-scale link prediction BIBAFull-Text 215-224
  Donghyuk Shin; Si Si; Inderjit S. Dhillon
The automated analysis of social networks has become an important problem due to the proliferation of social networks, such as LiveJournal, Flickr and Facebook. The scale of these social networks is massive and continues to grow rapidly. An important problem in social network analysis is proximity estimation that infers the closeness of different users. Link prediction, in turn, is an important application of proximity estimation. However, many methods for computing proximity measures have high computational complexity and are thus prohibitive for large-scale link prediction problems. One way to address this problem is to estimate proximity measures via low-rank approximation. However, a single low-rank approximation may not be sufficient to represent the behavior of the entire network. In this paper, we propose Multi-Scale Link Prediction (MSLP), a framework for link prediction, which can handle massive networks. The basic idea of MSLP is to construct low-rank approximations of the network at multiple scales in an efficient manner. To achieve this, we propose a fast tree-structured approximation algorithm.
   Based on this approach, MSLP combines predictions at multiple scales to make robust and accurate predictions. Experimental results on real-life datasets with more than a million nodes show the superior performance and scalability of our method.
An analysis of how ensembles of collective classifiers improve predictions in graphs BIBAFull-Text 225-234
  Hoda Eldardiry; Jennifer Neville
We present a theoretical analysis framework that shows how ensembles of collective classifiers can improve predictions for graph data. We show how collective ensemble classification reduces errors due to variance in learning and more interestingly inference. We also present an empirical framework that includes various ensemble techniques for classifying relational data using collective inference. The methods span single- and multiple-graph network approaches, and are tested on both synthetic and real world classification tasks. Our experimental results, supported by our theoretical justifications, confirm that ensemble algorithms that explicitly focus on both learning and inference processes and aim at reducing errors associated with both, are the best performers.
Density index and proximity search in large graphs BIBAFull-Text 235-244
  Nan Li; Xifeng Yan; Zhen Wen; Arijit Khan
Given a large real-world graph where vertices are associated with labels, how do we quickly find interesting vertex sets according to a given query? In this paper, we study label-based proximity search in large graphs, which finds the top-k query-covering vertex sets with the smallest diameters. Each set has to cover all the labels in a query. Existing greedy algorithms only return approximate answers, and do not scale well to large graphs. We propose a novel framework, called gDensity, which uses density index and likelihood ranking to find vertex sets in an efficient and accurate manner. Promising vertices are ordered and examined according to their likelihood to produce answers, and the likelihood calculation is greatly facilitated by density indexing. Techniques such as progressive search and partial indexing are further proposed. Experiments on real-world graphs show the efficiency and scalability of gDensity.
Gelling, and melting, large graphs by edge manipulation BIBAFull-Text 245-254
  Hanghang Tong; B. Aditya Prakash; Tina Eliassi-Rad; Michalis Faloutsos; Christos Faloutsos
Controlling the dissemination of an entity (e.g., meme, virus, etc) on a large graph is an interesting problem in many disciplines. Examples include epidemiology, computer security, marketing, etc. So far, previous studies have mostly focused on removing or inoculating nodes to achieve the desired outcome.
   We shift the problem to the level of edges and ask: which edges should we add or delete in order to speed-up or contain a dissemination? First, we propose effective and scalable algorithms to solve these dissemination problems. Second, we conduct a theoretical study of the two problems and our methods, including the hardness of the problem, the accuracy and complexity of our methods, and the equivalence between the different strategies and problems. Third and lastly, we conduct experiments on real topologies of varying sizes to demonstrate the effectiveness and scalability of our approaches.

IR track: language technologies

One seed to find them all: mining opinion features via association BIBAFull-Text 255-264
  Zhen Hai; Kuiyu Chang; Gao Cong
Feature-based opinion analysis has attracted extensive attention recently. Identifying features associated with opinions expressed in reviews is essential for fine-grained opinion mining. One approach is to exploit the dependency relations that occur naturally between features and opinion words, and among features (or opinion words) themselves. In this paper, we propose a generalized approach to opinion feature extraction by incorporating robust statistical association analysis in a bootstrapping framework. The new approach starts with a small set of feature seeds, on which it iteratively enlarges by mining feature-opinion, feature-feature, and opinion-opinion dependency relations. Two association model types, namely likelihood ratio tests (LRT) and latent semantic analysis (LSA), are proposed for computing the pair-wise associations between terms (features or opinions). We accordingly propose two robust bootstrapping approaches, LRTBOOT and LSABOOT, both of which need just a handful of initial feature seeds to bootstrap opinion feature extraction. We benchmarked LRTBOOT and LSABOOT against existing approaches on a large number of real-life reviews crawled from the cellphone and hotel domains. Experimental results using varying number of feature seeds show that the proposed association-based bootstrapping approach significantly outperforms the competitors. In fact, one seed feature is all that is needed for LRTBOOT to significantly outperform the other methods. This seed feature can simply be the domain feature, e.g., "cellphone" or "hotel". The consequence of our discovery is far reaching: starting with just one feature seed, typically just the domain concept word, LRTBOOT can automatically extract a large set of high-quality opinion features from the corpus without any supervision or labeled features. This means that the automatic creation of a set of domain features is no longer a pipe dream!
Topic-driven reader comments summarization BIBAFull-Text 265-274
  Zongyang Ma; Aixin Sun; Quan Yuan; Gao Cong
Readers of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking mechanisms for comments (e.g., by recency or by user rating) fail to offer an overall picture of topics discussed in comments. In this paper, we first propose to study Topic-driven Reader Comments Summarization (Torcs) problem. We observe that many news articles from a news stream are related to each other; so are their comments. Hence, news articles and their associated comments provide context information for user commenting. To implicitly capture the context information, we propose two topic models to address the Torcs problem, namely, Master-Slave Topic Model (MSTM) and Extended Master-Slave Topic Model (EXTM). Both models treat a news article as a master document and each of its comments as a slave document. MSTM model constrains that the topics discussed in comments have to be derived from the commenting news article. On the other hand, EXTM model allows generating words of comments using both the topics derived from the commenting news article, and the topics derived from all comments themselves. Both models are used to group comments into topic clusters. We then use two ranking mechanisms Maximal Marginal Relevance (MMR) and Rating & Length (RL) to select a few most representative comments from each comment cluster. To evaluate the two models, we conducted experiments on 1005 Yahoo! News articles with more than one million comments. Our experimental results show that EXTM significantly outperforms MSTM by perplexity. Through a user study, we also confirm that the comment summary generated by EXTM achieves better intra-cluster topic cohesion and inter-cluster topic diversity.
Visualizing timelines: evolutionary summarization via iterative reinforcement between text and image streams BIBAFull-Text 275-284
  Rui Yan; Xiaojun Wan; Mirella Lapata; Wayne Xin Zhao; Pu-Jen Cheng; Xiaoming Li
We present a novel graph-based framework for timeline summarization, the task of creating different summaries for different timestamps but for the same topic. Our work extends timeline summarization to a multimodal setting and creates timelines that are both textual and visual. Our approach exploits the fact that news documents are often accompanied by pictures and the two share some common content. Our model optimizes local summary creation and global timeline generation jointly following an iterative approach based on mutual reinforcement and co-ranking. In our algorithm, individual summaries are generated by taking into account the mutual dependencies between sentences and images, and are iteratively refined by considering how they contribute to the global timeline and its coherence. Experiments on real-world datasets show that the timelines produced by our model outperform several competitive baselines both in terms of ROUGE and when assessed by human evaluators.
Fast multi-task learning for query spelling correction BIBAFull-Text 285-294
  Xu Sun; Anshumali Shrivastava; Ping Li
In this paper, we explore the use of a novel online multi-task learning framework for the task of search query spelling correction. In our procedure, correction candidates are initially generated by a ranker-based system and then re-ranked by our multi-task learning algorithm. With the proposed multi-task learning method, we are able to effectively transfer information from different and highly biased training datasets, for improving spelling correction on all datasets. Our experiments are conducted on three query spelling correction datasets including the well-known TREC benchmark dataset. The experimental results demonstrate that our proposed method considerably outperforms the existing baseline systems in terms of accuracy. Importantly, the proposed method is about one order of magnitude faster than baseline systems in terms of training speed. Compared to the commonly used online learning methods which typically require more than (e.g.,) 60 training passes, our proposed method is able to closely reach the empirical optimum in about 5 passes.
Cross-argument inference for implicit discourse relation recognition BIBAFull-Text 295-304
  Yu Hong; Xiaopei Zhou; Tingting Che; Jianmin Yao; Qiaoming Zhu; Guodong Zhou
Motivated by the critical importance of connectives in recognizing discourse relations, we present an unsupervised cross-argument inference mechanism to implicit discourse relation recognition. The basic idea is to infer the implicit discourse relation of an argument pair from a large number of comparable argument pairs, which are automatically retrieved from the web in an unsupervised way. In this way, the inference proceeds from explicit relations to implicit ones via connective as bridge. This kind of pair-to-pair inference is based on the assumption that two argument pairs with high content similarity (i.e. comparable argument pairs) should have similar discourse relationship. Evaluation on PDTB proves the effectiveness of our inference mechanism in implicit relation recognition to the four level-1 relations. It also shows that our mechanism significantly outperforms other alternatives.

DB track: graph and knowledge base

Interpreting keyword queries over web knowledge bases BIBAFull-Text 305-314
  Jeffrey Pound; Alexander K. Hudek; Ihab F. Ilyas; Grant Weddell
Many keyword queries issued to Web search engines target information about real world entities, and interpreting these queries over Web knowledge bases can often enable the search system to provide exact answers to queries. Equally important is the problem of detecting when the reference knowledge base is not capable of answering the keyword query, due to lack of domain coverage.
   In this work we present an approach to computing structured representations of keyword queries over a reference knowledge base. We mine frequent query structures from a Web query log and map these structures into a reference knowledge base. Our approach exploits coarse linguistic structure in keyword queries, and combines it with rich structured query representations of information needs.
RDF pattern matching using sortable views BIBAFull-Text 315-324
  Zhihong Chong; He Chen; Zhenjie Zhang; Hu Shu; Guilin Qi; Aoying Zhou
In the last few years, RDF is becoming the dominating data model used in semantic web for knowledge representation and inference. In this paper, we revisit the problem of pattern matching query in RDF model, which is usually expensive in efficiency due to the huge cost on join operations. To alleviate the efficiency pain, view materialization techniques are usually deployed to accelerate the query processing. However, given an arbitrary view, it remains difficult to identify how to reuse the view for a particular query, because of the NP-hardness behind the algorithm matching patterns and views. To fully exploit the benefit of the materialized views, we propose a new paradigm to enhance the effectiveness of the materialized view. Instead of choosing materialized views in arbitrary form, our paradigm aims to select the views only if they are sortable. The property of sortability raises huge gains on the pattern-view matching, bringing down the cost to linear complexity in terms of the pattern size. On the other side, the costs on identifying sortable views and searching over the views using inverted index are affordable. Moreover, sortable views generally improve the overall performance of pattern matching, by means of a cost model used to optimize the query rewriting on the most appropriate views. Finally, we demonstrate extensive experimental results to verify the superiority of our proposal on both efficiency and effectiveness.
Efficient algorithms for generalized subgraph query processing BIBAFull-Text 325-334
  Wenqing Lin; Xiaokui Xiao; James Cheng; Sourav S. Bhowmick
We study a new type of graph queries, which injectively maps its edges to paths of the graphs in a given database, where the length of each path is constrained by a given threshold specified by the weight of the corresponding matching edge. We give important applications of the new graph query and identify new challenges of processing such a query. Then, we devise the cost model of the branch-and-bound algorithm framework for processing the graph query, and propose an efficient algorithm to minimize the cost overhead. We also develop three indexing techniques to efficiently answer the queries online. Finally, we verify the efficiency of our proposed indexes with extensive experiments on large real and synthetic datasets.
G-SPARQL: a hybrid engine for querying large attributed graphs BIBAFull-Text 335-344
  Sherif Sakr; Sameh Elnikety; Yuxiong He
We propose a SPARQL-like language, G-SPARQL, for querying attributed graphs. The language expresses types of queries which of large interest for applications which model their data as large graphs such as: pattern matching, reachability and shortest path queries. Each query can combine both of structural predicates and value-based predicates (on the attributes of the graph nodes and edges). We describe an algebraic compilation mechanism for our proposed query language which is extended from the relational algebra and based on the basic construct of building SPARQL queries, the Triple Pattern. We describe a hybrid Memory/Disk representation of large attributed graphs where only the topology of the graph is maintained in memory while the data of the graph is stored in a relational database. The execution engine of our proposed query language splits parts of the query plan to be pushed inside the relational database while the execution of other parts of the query plan are processed using memory-based algorithms, as necessary. Experimental results on real datasets demonstrate the efficiency and the scalability of our approach and show that our approach outperforms native graph databases by several factors.
A graph-based approach for ontology population with named entities BIBAFull-Text 345-354
  Wei Shen; Jianyong Wang; Ping Luo; Min Wang
Automatically populating ontology with named entities extracted from the unstructured text has become a key issue for Semantic Web and knowledge management techniques. This issue naturally consists of two subtasks: (1) for the entity mention whose mapping entity does not exist in the ontology, attach it to the right category in the ontology (i.e., fine-grained named entity classification), and (2) for the entity mention whose mapping entity is contained in the ontology, link it with its mapping real world entity in the ontology (i.e., entity linking). Previous studies only focus on one of the two subtasks and cannot solve this task of populating ontology with named entities integrally. This paper proposes APOLLO, a grAph-based aPproach for pOpuLating ontoLOgy with named entities. APOLLO leverages the rich semantic knowledge embedded in the Wikipedia to resolve this task via random walks on graphs. Meanwhile, APOLLO can be directly applied to either of the two subtasks with minimal revision. We have conducted a thorough experimental study to evaluate the performance of APOLLO. The experimental results show that APOLLO achieves significant accuracy improvement for the task of ontology population with named entities, and outperforms the baseline methods for both subtasks.

DB track: temporal, spatial and multimedia databases

Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient tensor decomposition BIBAFull-Text 355-364
  Mijung Kim; K. Selçuk Candan
For many multi-dimensional data applications, tensor operations as well as relational operations need to be supported throughout the data lifecycle. Although tensor decomposition is shown to be effective for multi-dimensional data analysis, the cost of tensor decomposition is often very high. We propose a novel decomposition-by-normalization scheme that first normalizes the given relation into smaller tensors based on the functional dependencies of the relation and then performs the decomposition using these smaller tensors. The decomposition and recombination steps of the decomposition-by-normalization scheme fit naturally in settings with multiple cores. This leads to a highly efficient, effective, and parallelized decomposition-by-normalization algorithm for both dense and sparse tensors. Experiments confirm the efficiency and effectiveness of the proposed decomposition-by-normalization scheme compared to the conventional nonnegative CP decomposition approach.
A filter-based protocol for continuous queries over imprecise location data BIBAFull-Text 365-374
  Yifan Jin; Reynold Cheng; Ben Kao; Kam-Yiu Lam; Yinuo Zhang
In typical location-based services (LBS), moving objects (e.g., GPS-enabled mobile phones) report their locations through a wireless network. An LBS server can use the location information to answer various types of continuous queries. Due to hardware limitations, location data reported by the moving objects are often uncertain. In this paper, we study efficient methods for the execution of Continuous Possible Nearest Neighbor Query (CPoNNQ) that accesses imprecise location data. A CPoNNQ is a standing query (which is active during a period of time) such that, at any time point, all moving objects that have non-zero probabilities of being the nearest neighbor of a given query point are reported. To handle the continuous nature of a CPoNNQ, a simple solution is to require moving objects to continuously report their locations to the LBS server, which evaluates the query at every time step. To save communication bandwidth and mobile devices' batteries, we develop two filter-based protocols for CPoNNQ evaluation. Our protocols install "filter bounds" on moving objects, which suppress unnecessary location reporting and communication between the server and the moving objects. Through extensive experiments, we show that our protocols can effectively reduce communication costs while maintaining a high query quality.
Leveraging read rates of passive RFID tags for real-time indoor location tracking BIBAFull-Text 375-384
  Da Yan; Zhou Zhao; Wilfred Ng
RFID (radio frequency identification) technology has been widely used for object tracking in many real-life applications, such as inventory monitoring and product flow tracking. These applications usually rely on passive RFID technologies rather than active ones, since passive RFID tags are more attractive than active ones in many aspects, such as lower tag cost and simpler maintenance.
   RFID technology is also important for indoor location tracking systems that require high degree of accuracy. However, most existing systems estimate object locations by using active RFID tags, which usually incur localization error of more than one meter. Although recent studies begin to investigate the application of passive tags for indoor location tracking, these methods are far from deployable and research of this application is still in its infancy.
   In this paper, we propose a new indoor location tracking system, named PassTrack, which relies on the read rates of passive RFID tags for location estimation. PassTrack is designed to tolerate noise arising from external environmental factors, by probabilistically modeling the relationship between tag read rate and tag-reader distance, and updating the model parameters based on the current readings of reference tags.
   Besides tolerance of noise, PassTrack is also outstanding in terms of localization accuracy and efficiency. Several new approaches for location inference are supported by PassTrack, and the best one incurs an average error of around 30 cm, and is able to carry out over 7500 location estimations per second on an ordinary machine. Furthermore, as a result of using passive RFID tags, PassTrack also enjoys the many other benefits of passive RFID tags mentioned before. We have conducted extensive experiments on both real and synthetic datasets, which demonstrate that our PassTrack system outperforms the previous localization approaches in localization accuracy, tracking efficiency and space applicability.
Location-aware instant search BIBAFull-Text 385-394
  Ruicheng Zhong; Ju Fan; Guoliang Li; Kian-Lee Tan; Lizhu Zhou
Location-Based Services (LBS) have been widely accepted by mobile users recently. Existing LBS-based systems require users to type in complete keywords. However for mobile users it is rather difficult to type in complete keywords on mobile devices. To alleviate this problem, in this paper we study the location-aware instant search problem, which returns users location-aware answers as users type in queries letter by letter. The main challenge is to achieve high interactive speed. To address this challenge, in this paper we propose a novel index structure, prefix-region tree (called PR-Tree), to efficiently support location-aware instant search. PR-Tree is a tree-based index structure which seamlessly integrates the textual description and spatial information to index the spatial data. Using the PR-Tree, we develop efficient algorithms to support single prefix queries and multi-keyword queries. Experiments show that our method achieves high performance and significantly outperforms state-of-the-art methods.
Indexing uncertain spatio-temporal data BIBAFull-Text 395-404
  Tobias Emrich; Hans-Peter Kriegel; Nikos Mamoulis; Matthias Renz; Andreas Züfle
The advances in sensing and telecommunication technologies allow the collection and management of vast amounts of spatio-temporal data combining location and time information. Due to physical and resource limitations of data collection devices (e.g., RFID readers, GPS receivers and other sensors) data are typically collected only at discrete points of time. In-between these discrete time instances, the positions of tracked moving objects are uncertain. In this work, we propose novel approximation techniques in order to probabilistically bound the uncertain movement of objects; these techniques allow for efficient and effective filtering during query evaluation using an hierarchical index structure. To the best of our knowledge, this is the first approach that supports query evaluation on very large uncertain spatio-temporal databases, adhering to possible worlds semantics. We experimentally show that it accelerates the existing, scan-based approach by orders of magnitude.

KM track: matrix methods and anomaly detection

Local anomaly descriptor: a robust unsupervised algorithm for anomaly detection based on diffusion space BIBAFull-Text 405-414
  Hao Huang; Hong Qin; Shinjae Yoo; Dantong Yu
Current popular anomaly detection algorithms are capable of detecting global anomalies but oftentimes fail to distinguish local anomalies from normal instances. This paper aims to improve unsupervised anomaly detection via the exploration of physics-based diffusion space. Building upon the embedding manifold derived from diffusion maps, we devise Local Anomaly Descriptor (LAD) whose originality results from faithfully preserving intrinsic and informative density-relevant neighborhood information. This robust and effective algorithm is designed with a weighted umbrella Laplacian operator to bridge global and local properties. To further enhance the efficacy of our proposed algorithm, we explore the utility of anisotropic Gaussian kernel (AGK) which can offer better manifold-aware affinity information. Comprehensive experiments on both synthetic and UCI real datasets verify that our LAD outperforms existing anomaly detection algorithms.
Fast and reliable anomaly detection in categorical data BIBAFull-Text 415-424
  Leman Akoglu; Hanghang Tong; Jilles Vreeken; Christos Faloutsos
Spotting anomalies in large multi-dimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce COMPREX, a new approach for identifying anomalies using pattern-based compression. Informally, our method finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags those points dissimilar to the norm -- with high compression cost -- as anomalies.
   Our approach exhibits four key features: 1) it is parameter-free; it builds dictionaries directly from data, and requires no user-specified parameters such as distance functions or density and similarity thresholds, 2) it is general; we show it works for a broad range of complex databases, including graph, image and relational databases that may contain both categorical and numerical features, 3) it is scalable; its running time grows linearly with respect to both database size as well as number of dimensions, and 4) it is effective; experiments on a broad range of datasets show large improvements in both compression, as well as precision in anomaly detection, outperforming its state-of-the-art competitors.
TALMUD: transfer learning for multiple domains BIBAFull-Text 425-434
  Orly Moreno; Bracha Shapira; Lior Rokach; Guy Shani
Most collaborative Recommender Systems (RS) operate in a single domain (such as movies, books, etc.) and are capable of providing recommendations based on historical usage data which is collected in the specific domain only. Cross-domain recommenders address the sparsity problem by using Machine Learning (ML) techniques to transfer knowledge from a dense domain into a sparse target domain. In this paper we propose a transfer learning technique that extracts knowledge from multiple domains containing rich data (e.g., movies and music) and generates recommendations for a sparse target domain (e.g., games). Our method learns the relatedness between the different source domains and the target domain, without requiring overlapping users between domains. The model integrates the appropriate amount of knowledge from each domain in order to enrich the target domain data. Experiments with several datasets reveal that, using multiple sources and the relatedness between domains improves accuracy of results.
Utilizing common substructures to speedup tensor factorization for mining dynamic graphs BIBAFull-Text 435-444
  Wei Liu; Jeffrey Chan; James Bailey; Christopher Leckie; Ramamohanarao Kotagiri
In large and complex graphs of social, chemical/biological, or other relations, frequent substructures are commonly shared by different graphs or by graphs evolving through different time periods. Tensors are natural representations of these complex time-evolving graph data. A factorization of a tensor provides a high-quality low-rank compact basis for each dimension of the tensor, which facilitates the interpretation of frequent substructures of the original graphs. However, the high computational cost of tensor factorization makes it infeasible for conventional tensor factorization methods to handle large graphs that evolve frequently with time. To address this problem, in this paper we propose a novel iterative tensor factorization (ITF) method whose time complexity is linear in the cardinalities of all dimensions of a tensor. This low time complexity means that when using tensors to represent dynamic graphs, the computational cost of ITF is linear in the size (number of edges/vertices) of graphs and is also linear in the number of time periods over which the graph evolves. More importantly, an error estimation of ITF suggests that its factorization correctness is comparable to that of the standard factorization method. We empirically evaluate our method on publication networks and chemical compound graphs, and demonstrate that ITF is an order of magnitude faster than the conventional method and at the same time preserves factorization quality. To the best of our knowledge, this research is the first work that uses important frequent substructures to speed up tensor factorizations for mining dynamic graphs.

KM track: social networks

Predicting emerging social conventions in online social networks BIBAFull-Text 445-454
  Farshad Kooti; Winter A. Mason; Krishna P. Gummadi; Meeyoung Cha
The way in which social conventions emerge in communities has been of interest to social scientists for decades. Here we report on the emergence of a particular social convention on Twitter -- the way to indicate a tweet is being reposted and attributing the content to its source. Despite being invented at different times and having different adoption rates, only two variations became widely adopted. In this paper we describe this process in detail, highlighting the factors that come into play in deciding which variation individuals will adopt. Our classification analysis demonstrates that the date of adoption and the number of exposures are particularly important in the adoption process, while personal features (such as the number of followers and join date) and the number of adopter friends have less discriminative power in predicting adoptions. We discuss implications of these findings in the design of future Web applications and services.
Collective intelligence in the online social network of yahoo!answers and its implications BIBAFull-Text 455-464
  Ze Li; Haiying Shen; Joseph Edward Grant
Question and Answer (Q&A) websites such as Yahoo!Answers provide a platform where users can post questions and receive answers. These systems take advantage of the collective intelligence of users to find information. In this paper, we analyze the online social network (OSN) in Yahoo!Answers. Based on a large amount of our collected data, we studied the OSN's structural properties, which reveals strikingly distinct properties such as low link symmetry and weak correlation between indegree and outdegree. After studying the knowledge base and behaviors of the users, we find that a small number of top contributors answer most of the questions in the system. Also, each top contributor focuses on only a few knowledge categories. In addition, the knowledge categories of the users are highly clustered. We also study the knowledge base in a user's social network, which reveals that the members in a user's social network share only a few knowledge categories. Based on the findings, we provide guidance in the design of spammer detection algorithms and distributed Q&A systems. We also propose a friendship-knowledge oriented Q&A framework that synergically combines current OSN-based Q&A and web Q&A. We believe that the results presented in this paper are crucial in understanding the collective intelligence in the web Q&A OSNs and lay a cornerstone for the evolution of next-generation Q&A systems.
From face-to-face gathering to social structure BIBAFull-Text 465-474
  Chunyan Wang; Mao Ye; Wang-chien Lee
The rapid development of on-line social networking sites has dramatically changed the way people live and communicate. One particularly interesting phenomena came along with this development is the prominent role of various on-line networking portals played in scheduling and organizing off-line group events and activities. In this paper, we focus on studying the face-to-face (f2f) group formed through, or facilitated by, on-line portals. We first show the distinct characteristics of such f2f groups by analyzing datasets collected from Whrrl and Meetup. Next, we propose a dynamic model for group gathering based on the process of friend invitation to interpret how a f2f group is formed on-line. The results of our model are confirmed by empirical observations. Finally, we demonstrate that using such group information can effectively improve the accuracies of social tie inference and friend recommendation.
Delineating social network data anonymization via random edge perturbation BIBAFull-Text 475-484
  Mingqiang Xue; Panagiotis Karras; Raissi Chedy; Panos Kalnis; Hung Keng Pung
Social network data analysis raises concerns about the privacy of related entities or individuals. To address this issue, organizations can publish data after simply replacing the identities of individuals with pseudonyms, leaving the overall structure of the social network unchanged. However, it has been shown that attacks based on structural identification (e.g., a walk-based attack) enable an adversary to re-identify selected individuals in an anonymized network. In this paper we explore the capacity of techniques based on random edge perturbation to thwart such attacks. We theoretically establish that any kind of structural identification attack can effectively be prevented using random edge perturbation and show that, surprisingly, important properties of the whole network, as well as of subgraphs thereof, can be accurately calculated and hence data analysis tasks performed on the perturbed data, given that the legitimate data recipient knows the perturbation probability as well. Yet we also examine ways to enhance the walk-based attack, proposing a variant we call probabilistic attack. Nevertheless, we demonstrate that such probabilistic attacks can also be prevented under sufficient perturbation. Eventually, we conduct a thorough theoretical study of the probability of success of any} structural attack as a function of the perturbation probability. Our analysis provides a powerful tool for delineating the identification risk of perturbed social network data; our extensive experiments with synthetic and real datasets confirm our expectations.

IR track: advertising

Multiview hierarchical Bayesian regression model and application to online advertising BIBAFull-Text 485-494
  Tianbing Xu; Ruofei Zhang; Zhen Guo
With the development of Web applications, large scale data are popular; and they are not only getting richer, but also ubiquitously interconnected with users and other objects in various ways, which brings about multi-view data with implicit structure. In this paper, we propose a novel hierarchical Bayesian mixture regression model, which discovers and then exploits the relationships among multiple views of the data to perform various machine learning tasks. A stochastic EM inference and learning algorithm is derived; and a parallel implementation in Hadoop MapReduce [9] paradigm is developed to scale up the learning. We apply the developed model and algorithm on click-through-rate (CTR) prediction and campaign targeting recommendation in online advertising to measure its effectiveness. The experiments on both synthetic data and large scale ads serving data from a real world online advertising exchange demonstrate the superior CTR prediction accuracy of our method compared to existing state-of-the-art methods. The results also show that our model can recommend high performance targeting features for online advertising campaigns.
Visual appearance of display ads and its effect on click through rate BIBAFull-Text 495-504
  Javad Azimi; Ruofei Zhang; Yang Zhou; Vidhya Navalpakkam; Jianchang Mao; Xiaoli Fern
One of the most important categories of online advertising is display advertising which provides publishers with significant revenue. Similar to other categories, the main goal in display advertising is to maximize user response rate for advertising campaigns, such as click through rates (CTR) or conversion rates. Previous studies have tried to optimize these parameters using objectives such as behavioral targeting. However, there is no published work so far to address the effect of the visual appearance of ads (creatives) on user response rate via a systematic data-driven approach. In this paper, we quantitatively study the relationship between the visual appearance and performance of creatives using large scale data in the world's largest display ads exchange system, RightMedia. We designed a set of 43 visual features, some of which are novel and others are inspired by related work. We extracted these features from real creatives served on RightMedia. We also designed and conducted a series of experiments to evaluate the effectiveness of visual features for CTR prediction, ranking and performance classification. Based on the evaluation results, we selected a subset of features that have the highest impact on CTR. We believe that the findings presented in this paper will be very useful for the online advertising industry in designing high-performance creatives. It also provides the research community with the first ever data set, initial insights into visual appearance's effect on user response propensity, and evaluation benchmarks for further study.
The wisdom of advertisers: mining subgoals via query clustering BIBAFull-Text 505-514
  Takehiro Yamamoto; Tetsuya Sakai; Mayu Iwata; Chen Yu; Ji-Rong Wen; Katsumi Tanaka
This paper tackles the problem of mining subgoals of a given search goal from data. For example, when a searcher wants to travel to London, she may need to accomplish several subtasks such as "book flights," "book a hotel," "find good restaurants" and "decide which sightseeing spots to visit." As another example, if a searcher wants to lose weight, there may exist several alternative solutions such as "do physical exercise," "take diet pills," and "control calorie intake." In this paper, we refer to such subtasks or solutions as subgoals, and propose to utilize sponsored search data for finding subgoals of a given query by means of query clustering. Advertisements (ads) reflect advertisers' tremendous efforts in trying to match a given query with implicit user needs. Moreover, ads are usually associated with a particular action or transaction. We therefore hypothesized that they are useful for subgoal mining. To our knowledge, our work is the first to use sponsored search data for this purpose. Our experimental results show that sponsored search data is a good resource for obtaining related queries and for identifying subgoals via query clustering. In particular, our method that combines ad impressions from sponsored search data and query co-occurrences from session data outperforms a state-of-the-art query clustering method that relies on document clicks rather than ad impressions in terms of purity, NMI, Rand Index, F1-measure and subgoal recall.
Sequential selection of correlated ads by POMDPs BIBAFull-Text 515-524
  Shuai Yuan; Jun Wang
Online advertising has become a key source of revenue for both web search engines and online publishers. For them, the ability of allocating right ads to right webpages is critical because any mismatched ads would not only harm web users' satisfactions but also lower the ad income. In this paper, we study how online publishers could optimally select ads to maximize their ad incomes over time. The conventional offline, content-based matching between webpages and ads is a fine start but cannot solve the problem completely because good matching does not necessarily lead to good payoff. Moreover, with the limited display impressions, we need to balance the need of selecting ads to learn true ad payoffs (exploration) with that of allocating ads to generate high immediate payoffs based on the current belief (exploitation). In this paper, we address the problem by employing Partially observable Markov decision processes (POMDPs) and discuss how to utilize the correlation of ads to improve the efficiency of the exploration and increase ad incomes in a long run. Our mathematical derivation shows that the belief states of correlated ads can be naturally updated using a formula similar to collaborative filtering. To test our model, a real world ad dataset from a major search engine is collected and categorized. Experimenting over the data, we provide an analyse of the effect of the underlying parameters, and demonstrate that our algorithms significantly outperform other strong baselines.

IR track: system architecture, distributed IR, scalability

Diversity in blog feed retrieval BIBAFull-Text 525-534
  Mostafa Keikha; Fabio Crestani; W. Bruce Croft
Blog distillation (blog feed retrieval) is a task in blog retrieval where the goal is to rank blogs according to their recurrent relevance to a query topic. One of the main properties of blog feed retrieval is that the unit of retrieval is a collection of documents as opposed to a single document as in other IR tasks. This collection retrieval nature of blog distillation introduces new challenges and requires new investigations specific to this problem.
   Researchers have addressed this problem by considering a wide range of evidence and information resources. However, previous work has not studied the effect of on-topic diversity of blog posts in blog relevance. By on-topic diversity of blog posts we mean that those posts that are about the query topic need to have high diversity and cover different sub-topics of the query.
   In this study, we investigate three types of on-topic diversity and their effect on retrieval performance: topical diversity, temporal diversity and hybrid diversity. Our experiments over different blog collections and different baseline methods show that on-topic diversity can improve the performance of the retrieval system. Among the three types of diversity, hybrid diversity, that considers both topical and temporal diversities, achieves the best performance.
Efficient retrieval of recommendations in a matrix factorization framework BIBAFull-Text 535-544
  Noam Koenigstein; Parikshit Ram; Yuval Shavitt
Low-rank Matrix Factorization (MF) methods provide one of the simplest and most effective approaches to collaborative filtering. This paper is the first to investigate the problem of efficient retrieval of recommendations in a MF framework. We reduce the retrieval in a MF model to an apparently simple task of finding the maximum dot-product for the user vector over the set of item vectors. However, to the best of our knowledge the problem of efficiently finding the maximum dot-product in the general case has never been studied. To this end, we propose two techniques for efficient search -- (i) We index the item vectors in a binary spatial-partitioning metric tree and use a simple branch and-bound algorithm with a novel bounding scheme to efficiently obtain exact solutions. (ii) We use spherical clustering to index the users on the basis of their preferences and pre-compute recommendations only for the representative user of each cluster to obtain extremely efficient approximate solutions. We obtain a theoretical error bound which determines the quality of any approximate result and use it to control the approximation. Both these simple techniques are fairly independent of each other and hence are easily combined to further improve recommendation retrieval efficiency. We evaluate our algorithms on real-world collaborative-filtering datasets, demonstrating more than ×7 speedup (with respect to the naive linear search) for the exact solution and over ×250 speedup for approximate solutions by combining both techniques.
KORE: keyphrase overlap relatedness for entity disambiguation BIBAFull-Text 545-554
  Johannes Hoffart; Stephan Seufert; Dat Ba Nguyen; Martin Theobald; Gerhard Weikum
Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.
Shard ranking and cutoff estimation for topically partitioned collections BIBAFull-Text 555-564
  Anagha Kulkarni; Almer S. Tigelaar; Djoerd Hiemstra; Jamie Callan
Large document collections can be partitioned into 'topical shards' to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.

KM track: advertisement and products

Daily-deal selection for revenue maximization BIBAFull-Text 565-574
  Theodoros Lappas; Evimaria Terzi
Daily-Deal Sites (DDS) like Groupon, LivingSocial, Amazon's Goldbox, and many more, have become particularly popular over the last three years, providing discounted offers to customers for restaurants, ticketed events, services etc. In this paper, we study the following problem: among a set of candidate deals, which are the ones that a DDS should feature as daily-deals in order to maximize its revenue? Our first contribution lies in providing two combinatorial formulations of this problem. Both formulations take into account factors like the diversification of daily deals and the limited consuming capacity of the userbase. We prove that our problems are NP-hard and devise pseudopolynomial -- time approximation algorithms for their solution. We also propose a set of heuristics, and demonstrate their efficiency in our experiments. In the context of deal selection and scheduling, we acknowledge the importance of the ability to estimate the expected revenue of a candidate deal. We explore the nature of this task in the context of real data, and propose a framework for revenue-estimation. We demonstrate the effectiveness of our entire methodology in an experimental evaluation on a large dataset of daily-deals from Groupon.
Enabling direct interest-aware audience selection BIBAFull-Text 575-584
  Ariel Fuxman; Anitha Kannan; Zhenhui Li; Panayiotis Tsaparas
Advertisers typically have a fairly accurate idea of the interests of their target audience. However, today's online advertising systems are unable to leverage this information. The reasons are two-fold. First, there is no agreed upon vocabulary of interests for advertisers and advertising systems to communicate. More importantly, advertising systems lack a mechanism for mapping users to the interest vocabulary.
   In this paper, we tackle both problems. We present a system for direct interest-aware audience selection. This system takes the query histories of search engine users as input, extracts their interests, and describes them with interpretable labels. The labels are not drawn from a predefined taxonomy, but rather dynamically generated from the query histories, and are thus easy for the advertisers to interpret and use for targeting users. In addition, the system enables seamless addition of interest labels that may be provided by the advertiser.
Influence propagation in adversarial setting: how to defeat competition with least amount of investment BIBAFull-Text 585-594
  Shahrzad Shirazipourazad; Brian Bogard; Harsh Vachhani; Arunabha Sen; Paul Horn
It has been observed that individuals' decisions to adopt a product or innovation are often influenced by the recommendations of their friends and acquaintances. Motivated by this observation, the last few years have seen a number of studies on influence maximization in social networks. The primary goal of these studies is identification of k most influential nodes in a network. A major limitation of these studies is that they focus on a non-adversarial environment, where only one player is engaged in influencing the nodes. However, in a realistic scenario multiple players attempt to influence the nodes in a competitive fashion. The proposed model considers a competitive environment where a node that has not yet adopted an innovation, can adopt only one of the several competing innovations and once it adopts an innovation, it does not switch. The paper studies the scenario where the first player has already chosen a set of k nodes and the second player, with the knowledge of the choice of the first, attempts to identify a smallest set of nodes (excluding the ones already chosen by the first) so that when the influence propagation process ends, the number of nodes influenced by the second player is larger than the number of nodes influenced by the first.
   The paper studies two propagation models and shows that in both the models, the identification of the smallest set of nodes to defeat the adversary is NP-Hard. It provides an approximation algorithm and proves that the performance bound is tight. It also presents the results of extensive experimentation using the collaboration network data. Experimental results show that the second player can easily defeat the first with this algorithm, if the first utilizes the node degree or closeness centrality based algorithms for the selection of influential nodes. The proposed algorithm also provides better performance if the second player utilizes it instead of the greedy algorithm to maximize its influence.
Large-scale item categorization for e-commerce BIBAFull-Text 595-604
  Dan Shen; Jean-David Ruvini; Badrul Sarwar
This paper studies the problem of leveraging computationally intensive classification algorithms for large scale text categorization problems. We propose a hierarchical approach which decomposes the classification problem into a coarse level task and a fine level task. A simple yet scalable classifier is applied to perform the coarse level classification while a more sophisticated model is used to separate classes at the fine level. However, instead of relying on a human-defined hierarchy to decompose the problem, we use a graph algorithm to discover automatically groups of highly similar classes. As an illustrative example, we apply our approach to real-world industrial data from eBay, a major e-commerce site where the goal is to classify live items into a large taxonomy of categories. In such industrial setting, classification is very challenging due to the number of classes, the amount of training data, the size of the feature space and the real-world requirements on the response time. We demonstrate through extensive experimental evaluation that (1) the proposed hierarchical approach is superior to flat models, and (2) the data-driven extraction of latent groups works significantly better than the existing human-defined hierarchy.
Matching product titles using web-based enrichment BIBAFull-Text 605-614
  Vishrawas Gopalakrishnan; Suresh Parthasarathy Iyengar; Amit Madaan; Rajeev Rastogi; Srinivasan Sengamedu
Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web search engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.

KM track: clustering

Scalable clustering of signed networks using balance normalized cut BIBAFull-Text 615-624
  Kai-Yang Chiang; Joyce Jiyoung Whang; Inderjit S. Dhillon
We consider the general $k$-way clustering problem in signed social networks where relationships between entities can be either positive or negative. Motivated by social balance theory, the clustering problem in signed networks aims to find mutually antagonistic groups such that entities within the same group are friends with each other. A recent method proposed in [13] extended the spectral clustering algorithm to the signed network setting by considering the signed graph Laplacian. This has been shown to be equivalent to finding clusters that minimize the 2-way signed ratio cut. In this paper, we show that there is a fundamental weakness when we directly extend the signed Laplacian to the k-way clustering problem. To overcome this weakness, we formulate new k-way objectives for signed networks. In particular, we propose a criterion that is analogous to the normalized cut, called balance normalized cut, which is not only theoretically sound but also experimentally effective in k-way clustering. In addition, we prove that these objectives are equivalent to weighted kernel k-means objectives by choosing an appropriate kernel matrix. Employing this equivalence, we develop a multilevel clustering framework for signed networks. In this framework, we coarsen the graph level by level and refine the clustering results at each level via a k-means based algorithm so that the signed clustering objectives are optimized. This approach gives good quality clustering results, and is also highly efficient and scalable. In experiments, we see that our multilevel approach is competitive to other state-of-the-art methods, while it is much faster and more scalable. In particular, the largest graph we have considered in our experiments contains 1 million nodes and 100 million edges -- this graph can be clustered in less than four hundred seconds using our algorithm.
Maximum margin clustering on evolutionary data BIBAFull-Text 625-634
  Xuhui Fan; Lin Zhu; Longbing Cao; Xia Cui; Yew-Soon Ong
Evolutionary data, such as topic changing blogs and evolving trading behaviors in capital market, is widely seen in business and social applications. The time factor and intrinsic change embedded in evolutionary data greatly challenge evolutionary clustering. To incorporate the time factor, existing methods mainly regard the evolutionary clustering problem as a linear combination of snapshot cost and temporal cost, and reflect the time factor through the temporal cost. It still faces accuracy and scalability challenge though promising results gotten. This paper proposes a novel evolutionary clustering approach, evolutionary maximum margin clustering (e-MMC), to cluster large-scale evolutionary data from the maximum margin perspective. e-MMC incorporates two frameworks: Data Integration from the data changing perspective and Model Integration corresponding to model adjustment to tackle the time factor and change, with an adaptive label allocation mechanism. Three e-MMC clustering algorithms are proposed based on the two frameworks. Extensive experiments are performed on synthetic data, UCI data and real-world blog data, which confirm that e-MMC outperforms the state-of-the-art clustering algorithms in terms of accuracy, computational cost and scalability. It shows that e-MMC is particularly suitable for clustering large-scale evolving data.
Document-topic hierarchies from document graphs BIBAFull-Text 635-644
  Tim Weninger; Yonatan Bisk; Jiawei Han
Topic taxonomies present a multi-level view of a document collection, where general topics live towards the top of the taxonomy and more specific topics live towards the bottom. Topic taxonomies allow users to quickly drill down into their topic of interest to find documents. We show that hierarchies of documents, where documents live at the inner nodes of the hierarchy-tree can also be inferred by combining document text with inter-document links. We present a Bayesian generative model by which an explicit hierarchy of documents is created. Experiments on three document-graph data sets shows that the generated document hierarchies are able to fit the observed data, and that the levels in the constructed document hierarchy represent practical groupings.
Improving document clustering using automated machine translation BIBAFull-Text 645-653
  Xiang Wang; Buyue Qian; Ian Davidson
With the development of statistical machine translation, we have ready-to-use tools that can translate documents from one language to many other languages. These translations provide different yet correlated views of the same set of documents. This gives rise to an intriguing question: can we use the extra information to achieve a better clustering of the documents? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative approach to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated from machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consistently improve the clustering of real data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to existing multiview clustering algorithms, our technique does not need the compatibility or the conditional independence assumption, nor does it involve subtle parameter tuning.
Right-protected data publishing with hierarchical clustering preservation BIBAFull-Text 654-663
  Michail Vlachos; Aleksander Wieczorek; Johannes Schneider
The emergence of cloud-based storage services is opening up new avenues in data exchange and data dissemination. This has amplified the interest in right-protection mechanisms for establishing ownership in case of data leakage. Current right-protection technologies, however, rarely provide strong guarantees on the dataset utility after the protection process. This work presents techniques that explicitly address this shortcoming and provably preserve the outcome of certain mining operations. In particular, we take special care to guarantee that the outcome of hierarchical clustering operations remains the same before and after right protection. We encode data ownership using watermarking principles. In the process, we derive fundamental bounds on the distortion incurred by the watermarking. We leverage our theoretical analysis to design fast algorithms for right protection without exhaustively searching the vast design space.

IR track: recommendation systems

Metaphor: a system for related search recommendations BIBAFull-Text 664-673
  Azarias Reda; Yubin Park; Mitul Tiwari; Christian Posse; Sam Shah
Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members' search experience in finding relevant results to their queries. This paper describes the design, implementation, and deployment of Metaphor, the related search recommendation system on LinkedIn, a professional social networking site with over 175 million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity. The system, which has been in live operation for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation click behavior. We also discuss some of the practical concerns in deploying related search recommendations.
Exploring personal impact for group recommendation BIBAFull-Text 674-683
  Xingjie Liu; Yuan Tian; Mao Ye; Wang-Chien Lee
Group activities are essential ingredients of people's social life. The rapid growth of online social networking services has greatly boosted group activities by providing convenient platform for users to organize and participate in such activities. Therefore, recommender systems, as a critical component in social networking services, now face new challenges in supporting group activities. In this paper, we study the group recommendation problem, i.e., making recommendations to a group of people in social networking services. We analyze the decision making process in a group to propose a personal impact topic (PIT) model for group recommendations. The PIT model effectively identifies the group preference profile for a given group by considering the personal preferences and personal impacts of group members. Moreover, we further enhance the discovery of personal impact with social network information to obtain an extended personal impact topic (E-PIT) model. We have conducted comprehensive data analysis and evaluations on three real datasets. The results show that our proposed group recommendation techniques outperform baseline approaches.
The efficient imputation method for neighborhood-based collaborative filtering BIBAFull-Text 684-693
  Yongli Ren; Gang Li; Jun Zhang; Wanlei Zhou
As each user tends to rate a small proportion of available items, the resulted Data Sparsity issue brings significant challenges to the research of recommender systems. This issue becomes even more severe for neighborhood-based collaborative filtering methods, as there are even lower numbers of ratings available in the neighborhood of the query item. In this paper, we aim to address the Data Sparsity issue in the context of the neighborhood-based collaborative filtering. Given the (user, item) query, a set of key ratings are identified, and an auto-adaptive imputation method is proposed to fill the missing values in the set of key ratings. The proposed method can be used with any similarity metrics, such as the Pearson Correlation Coefficient and Cosine-based similarity, and it is theoretically guaranteed to outperform the neighborhood-based collaborative filtering approaches. Results from experiments prove that the proposed method could significantly improve the accuracy of recommendations for neighborhood-based Collaborative Filtering algorithms.
Multi-faceted ranking of news articles using post-read actions BIBAFull-Text 694-703
  Deepak Agarwal; Bee-Chung Chen; Xuanhui Wang
Personalized article recommendation is important for news portals to improve user engagement. Existing work quantifies engagement primarily through click rates. We suggest that quality of recommendations may be improved by exploiting different types of "post-read" engagement signals like sharing, commenting, printing and e-mailing article links. Specifically, we propose a multi-faceted ranking problem for recommending articles, where each facet corresponds to a ranking task that seeks to maximize actions of a particular post-read type (e.g., ranking articles to maximize sharing actions). Our approach is to predict the probability that a user would take a post-read action on an article, so that articles can be ranked according to such probabilities. However, post-read actions are rare events -- enormous data sparsity makes the problem challenging. We meet the challenge by exploiting correlations across different post-read action types through a novel locally augmented tensor (LAT) model, so that the ranking performance of a particular action type can be improved by leveraging data from all other action types. Through extensive experiments, we show that our LAT model significantly outperforms a variety of state-of-the-art factor models, logistic regression and IR models.
A decentralized recommender system for effective web credibility assessment BIBAFull-Text 704-713
  Thanasis G. Papaioannou; Jean-Eudes Ranvier; Alexandra Olteanu; Karl Aberer
An overwhelming and growing amount of data is available online. The problem of untrustworthy online information is augmented by its high economic potential and its dynamic nature, e.g. transient domain names, dynamic content, etc. In this paper, we address the problem of assessing the credibility of web pages by a decentralized social recommender system. Specifically, we concurrently employ i) item-based collaborative filtering (CF) based on specific web page features, ii) user-based CF based on friend ratings and iii) the ranking of the page in search results. These factors are appropriately combined into a single assessment based on adaptive weights that depend on their effectiveness for different topics and different fractions of malicious ratings. Simulation experiments with real traces of web page credibility evaluations suggest that our hybrid approach outperforms both its constituent components and classical content-based classification approaches.

IR track: digital libraries and citation analysis

Towards an effective and unbiased ranking of scientific literature through mutual reinforcement BIBAFull-Text 714-723
  Xiaorui Jiang; Xiaoping Sun; Hai Zhuge
It is important to help researchers find valuable scientific papers from a large literature collection containing information of authors, papers and venues. Graph-based algorithms have been proposed to rank papers based on networks formed by citation and co-author relationships. This paper proposes a new graph-based ranking framework MutualRank that integrates mutual reinforcement relationships among networks of papers, researchers and venues to achieve a more synthetic, accurate and fair ranking result than previous graph-based methods. MutualRank leverages the network structure information among papers, authors, and their venues available from a literature collection dataset and sets up a unified mutual reinforcement model that involves both intra- and inter-network information for ranking papers, authors and venues simultaneously. To evaluate, we collect a set of recommended papers from websites of graduate-level computational linguistics courses of 15 top universities as the benchmark and apply different methods to estimate paper importance. The results show that MutualRank greatly outperforms the competitors including Pag-eRank, HITS and CoRank in ranking papers as well as researchers. The experimental results also demonstrate that venues ranked by MutualRank are reasonable.
A math-aware search engine for math question answering system BIBAFull-Text 724-733
  Tam T. Nguyen; Kuiyu Chang; Siu Cheung Hui
We propose a math-aware search engine that is capable of handling both textual keywords as well as mathematical expressions. Our math feature extraction and representation framework captures the semantics of math expressions via a Finite State Machine model. We adapt the passive aggressive online learning binary classifier as the ranking model. We benchmarked our approach against three classical information retrieval (IR) strategies on math documents crawled from Math Overflow, a well-known online math question answering system. Experimental results show that our proposed approach can perform better than other methods by more than 9%.
Contextualization using hyperlinks and internal hierarchical structure of Wikipedia documents BIBAFull-Text 734-743
  Muhammad Ali Norozi; Paavo Arvola; Arjen P. de Vries
Context surrounding hyperlinked semi-structured documents, externally in the form of citations and internally in the form of hierarchical structure, contains a wealth of useful but implicit evidence about a document's relevance. These rich sources of information should be exploited as contextual evidence. This paper proposes various methods of accumulating evidence from the context, and measures the effect of contextual evidence on retrieval effectiveness for document and focused retrieval of hyperlinked semi-structured documents.
   We propose a re-weighting model to contextualize (a) evidence from citations in a query-independent and query-dependent fashion (based on Markovian random walks) and (b) evidence accumulated from the internal tree structure of documents. The in-links and out-links of a node in the citation graph are used as external context, while the internal document structure provides internal, within-document context. We hypothesize that documents in a good context (having strong contextual evidence) should be good candidates to be relevant to the posed query, and vice versa.
   We tested several variants of contextualization and verified notable improvements in comparison with the baseline system and gold standards in the retrieval of full documents and focused elements.
Understanding book search behavior on the web BIBAFull-Text 744-753
  Jin Young Kim; Henry Feild; Marc Cartright
With the increased availability of e-books and digitized book collections, more users are searching the web for information about books. There are many online digital libraries containing book, author and subject data, which are accessed via internal search services as well as external web sites, such as Google. Although this is a common yet complex information-seeking behavior involving multiple search systems with different characteristics, little is known about how users find information in this scenario.
   In this work, we analyze web-based book search behavior using three months of logs from the Open Library, a globally accessible digital library. Our study encompasses the user behavior on web search engines and the digital library, unlike previous work which focused on institution-level digital libraries. Among our findings are (1) query characteristics and session-level behaviors are drastically different between internal and external searchers; (2) the field usage is different based on the modes of interaction -- keyword search, advanced search interface and faceted filtering; (3) users go through with more iterations of faceted filtering than query reformulation. To facilitate future research on book search, we also create a book search test collection based on the log data. We then perform an evaluation of several retrieval methods, finding that field-based retrieval models have advantages over document-based models.
Temporal corpus summarization using submodular word coverage BIBAFull-Text 754-763
  Ruben Sipos; Adith Swaminathan; Pannaga Shivaswamy; Thorsten Joachims
In many areas of life, we now have almost complete electronic archives reaching back for well over two decades. This includes, for example, the body of research papers in computer science, all news articles written in the US, and most people's personal email. However, we have only rather limited methods for analyzing and understanding these collections. While keyword-based retrieval systems allow efficient access to individual documents in archives, we still lack methods for understanding a corpus as a whole. In this paper, we explore methods that provide a temporal summary of such corpora in terms of landmark documents, authors, and topics. In particular, we explicitly model the temporal nature of influence between documents and re-interpret summarization as a coverage problem over words anchored in time. The resulting models provide monotone sub-modular objectives for computing informative and non-redundant summaries over time, which can be efficiently optimized with greedy algorithms. Our empirical study shows the effectiveness of our approach over several baselines.

KM track: text mining

TCSST: transfer classification of short & sparse text using external data BIBAFull-Text 764-772
  Guodong Long; Ling Chen; Xingquan Zhu; Chengqi Zhang
Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods.
The generalized dirichlet distribution in enhanced topic detection BIBAFull-Text 773-782
  Karla L. Caballero; Joel Barajas; Ram Akella
We present a new, robust and computationally efficient Hierarchical Bayesian model for effective topic correlation modeling. We model the prior distribution of topics by a Generalized Dirichlet distribution (GD) rather than a Dirichlet distribution as in Latent Dirichlet Allocation (LDA). We define this model as GD-LDA. This framework captures correlations between topics, as in the Correlated Topic Model (CTM) and Pachinko Allocation Model (PAM), and is faster to infer than CTM and PAM. GD-LDA is effective to avoid over-fitting as the number of topics is increased. As a tree model, it accommodates the most important set of topics in the upper part of the tree based on their probability mass. Thus, GD-LDA provides the ability to choose significant topics effectively. To discover topic relationships, we perform hyper-parameter estimation based on Monte Carlo EM Estimation. We provide results using Empirical Likelihood (EL) in 4 public datasets from TREC and NIPS. Then, we present the performance of GD-LDA in ad hoc information retrieval (IR) based on MAP, P@10, and Discounted Gain. We discuss an empirical comparison of the fitting time. We demonstrate significant improvement over CTM, LDA, and PAM for EL estimation. For all the IR measures, GD-LDA shows higher performance than LDA, the dominant topic model in IR. All these improvements with a small increase in fitting time than LDA, as opposed to CTM and PAM.
Modeling topic hierarchies with the recursive Chinese restaurant process BIBAFull-Text 783-792
  Joon Hee Kim; Dongwoo Kim; Suin Kim; Alice Oh
Topic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that they do not organize the topics into a hierarchical structure which is naturally found in many datasets. We introduce the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width. Unlike previous models for discovering topic hierarchies, rCRP allows the documents to be generated from a mixture over the entire set of topics in the hierarchy. We apply rCRP to a corpus of New York Times articles, a dataset of MovieLens ratings, and a set of Wikipedia articles and show the discovered topic hierarchies. We compare the predictive power of rCRP with LDA, HDP, and nested Chinese restaurant process (nCRP) using heldout likelihood to show that rCRP outperforms the others. We suggest two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of rCRP and nCRP. The results show that rCRP discovers a hierarchy in which the topics become more specialized toward the leaves, and topics in the immediate family exhibit more affinity than topics beyond the immediate family.
Two-part segmentation of text documents BIBAFull-Text 793-802
  A Deepak P.; Karthik Visweswariah; Nirmalie Wiratunga; Sadiq Sani
We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.
On the design of IDA models for aspect-based opinion mining BIBAFull-Text 803-812
  Samaneh Moghaddam; Martin Ester
Aspect-based opinion mining, which aims to extract aspects and their corresponding ratings from customers reviews, provides very useful information for customers to make purchase decisions. In the past few years several probabilistic graphical models have been proposed to address this problem, most of them based on Latent Dirichlet Allocation (LDA). While these models have a lot in common, there are some characteristics that distinguish them from each other. These fundamental differences correspond to major decisions that have been made in the design of the LDA models. While research papers typically claim that a new model outperforms the existing ones, there is normally no "one-size-fits-all" model. In this paper, we present a set of design guidelines for aspect-based opinion mining by discussing a series of increasingly sophisticated LDA models. We argue that these models represent the essence of the major published methods and allow us to distinguish the impact of various design decisions. We conduct extensive experiments on a very large real life dataset from Epinions.com (500K reviews) and compare the performance of different models in terms of the likelihood of the held-out test set and in terms of the accuracy of aspect identification and rating prediction.

IR track: formal retrieval models and learning to rank

Predicting query performance for fusion-based retrieval BIBAFull-Text 813-822
  Gad Markovits; Anna Shtok; Oren Kurland; David Carmel
Estimating the effectiveness of a search performed in response to a query in the absence of relevance judgments is the goal of query-performance prediction methods. Post-retrieval predictors analyze the result list of the most highly ranked documents. We address the prediction challenge for retrieval approaches wherein the final result list is produced by fusing document lists that were retrieved in response to a query. To that end, we present a novel fundamental prediction framework that accounts for this special characteristics of the fusion setting; i.e., the use of intermediate retrieved lists. The framework is based on integrating prediction performed upon the final result list with that performed upon the lists that were fused to create it; prediction integration is controlled based on inter-list similarities. We empirically demonstrate the merits of various predictors instantiated from the framework. A case in point, their prediction quality substantially transcends that of applying state-of-the-art predictors upon the final result list.
Back to the roots: a probabilistic framework for query-performance prediction BIBAFull-Text 823-832
  Oren Kurland; Anna Shtok; Shay Hummel; Fiana Raiber; David Carmel; Ofri Rom
The query-performance prediction task is estimating the effectiveness of a search performed in response to a query when no relevance judgments are available. Although there exist many effective prediction methods, these differ substantially in their basic principles, and rely on diverse hypotheses about the characteristics of effective retrieval. We present a novel fundamental probabilistic prediction framework. Using the framework, we derive and explain various previously proposed prediction methods that might seem completely different, but turn out to share the same formal basis. The derivations provide new perspectives on several predictors (e.g., Clarity). The framework is also used to devise new prediction approaches that outperform the state-of-the-art.
Learning to rank for robust question answering BIBAFull-Text 833-842
  Arvind Agarwal; Hema Raghavan; Karthik Subbian; Prem Melville; Richard D. Lawrence; David C. Gondek; James Fan
This paper aims to solve the problem of improving the ranking of answer candidates for factoid based questions in a state-of-the-art Question Answering system. We first provide an extensive comparison of 5 ranking algorithms on two datasets -- from the Jeopardy quiz show and a medical domain. We then show the effectiveness of a cascading approach, where the ranking produced by one ranker is used as input to the next stage. The cascading approach shows sizeable gains on both datasets. We finally evaluate several rank aggregation techniques to combine these algorithms, and find that Supervised Kemeny aggregation is a robust technique that always beats the baseline ranking approach used by Watson for the Jeopardy competition. We further corroborate our results on TREC Question Answering datasets.
Learning to rank by aggregating expert preferences BIBAFull-Text 843-851
  Maksims N. Volkovs; Hugo Larochelle; Richard S. Zemel
We present a general treatment of the problem of aggregating preferences from several experts into a consensus ranking, in the context where information about a target ranking is available. Specifically, we describe how such problems can be converted into a standard learning-to-rank one on which existing learning solutions can be invoked. This transformation allows us to optimize the aggregating function for any target IR metric, such as Normalized Discounted Cumulative Gain, or Expected Reciprocal Rank. When applied to crowdsourcing and meta-search benchmarks, our new algorithm improves on state-of-the-art preference aggregation methods.
Learning to rank duplicate bug reports BIBAFull-Text 852-861
  Jian Zhou; Hongyu Zhang
For a large and complex software system, the project team could receive a large number of bug reports. Some bug reports could be duplicates as they essentially report the same problem. It is often tedious and costly to manually check if a newly reported bug is a duplicate of an already reported bug. In this paper, we propose BugSim, a method that can automatically retrieve duplicate bug reports given a new bug report. BugSim is based on learning to rank concepts. We identify textual and statistical features of bug reports and propose a similarity function for bug reports based on the features. We then construct a training set by assembling pairs of duplicate and non-duplicate bug reports. We train the weights of features by applying the stochastic gradient descent algorithm over the training set. For a new bug report, we retrieve candidate duplicate reports using the trained model. We evaluate BugSim using more than 45,100 real bug reports of twelve Eclipse projects. The evaluation results show that the proposed method is effective. On average, the recall rate for the top 10 retrieved reports is 76.11%. Furthermore, BugSim outperforms the previous state-of-art methods that are implemented using SVM and BM25Fext.

DB track: probabilistic and uncertain data

A model-based approach for RFID data stream cleansing BIBAFull-Text 862-871
  Zhou Zhao; Wilfred Ng
In recent years, RFID technologies have been used in many applications, such as inventory checking and object tracking. However, raw RFID data are inherently unreliable due to physical device limitations and different kinds of environmental noise. Currently, existing work mainly focuses on RFID data cleansing in a static environment (e.g. inventory checking). It is therefore difficult to cleanse RFID data streams in a mobile environment (e.g. object tracking) using the existing solutions, which do not address the data missing issue effectively.
   In this paper, we study how to cleanse RFID data streams for object tracking, which is a challenging problem, since a significant percentage of readings are routinely dropped. We propose a probabilistic model for object tracking in a mobile environment. We develop a Bayesian inference based approach for cleansing RFID data using the model. In order to sample data from the movement distribution, we devise a sequential sampler that cleans RFID data with high accuracy and efficiency. We validate the effectiveness and robustness of our solution through extensive simulations and demonstrate its performance by using two real RFID applications of human tracking and conveyor belt monitoring.
What is the IQ of your data transformation system? BIBAFull-Text 872-881
  Giansalvatore Mecca; Paolo Papotti; Salvatore Raunich; Donatello Santoro
Mapping and translating data across different representations is a crucial problem in information systems. Many formalisms and tools are currently used for this purpose, to the point that developers typically face a difficult question: "what is the right tool for my translation task?" In this paper, we introduce several techniques that contribute to answer this question. Among these, a fairly general definition of a data transformation system, a new and very efficient similarity measure to evaluate the outputs produced by such a system, and a metric to estimate user efforts. Based on these techniques, we are able to compare a wide range of systems on many translation tasks, to gain interesting insights about their effectiveness, and, ultimately, about their "intelligence".
On the foundations of probabilistic information integration BIBAFull-Text 882-891
  Fereidoon Sadri
Information integration has been a subject of research for several decades and still remains a very active research area. Many new applications depend or benefit from large scale integration. Examples include large research projects in life sciences, need for data sharing among government agencies, reliance of corporations on business intelligence (which requires data integration from many heterogeneous sources), and integration of information on the web. The importance of information integration with uncertainty has been observed in recent years. Frequently, information from multiple sources are uncertain and possibly inconsistent. Further the process of integration often depends on approximate schema mappings, another source of uncertainty. An integration system is useful only to the extent that the information it produces can be trusted. Hence, providing a measure of certainty for integrated information is of crucial importance in many important applications.
   In this paper we study the problem of integration of uncertain information. We present a simple and intuitive approach to the representation and integration of uncertain information from multiple sources, and show that our integration approach coincides with a recent formalism for uncertain information integration. We extend the model to probabilistic possible-worlds, and show certain unintuitive constraints are imposed upon probabilities of possible-worlds of sources. In particular, we show the probabilities of possible worlds of a source are not independent, rather, they are dependent on probabilities of other sources. We study the problem of determining the probabilities for the result of integration. Finally, we present a practical approach to relaxing probabilistic constraints in integration.
GPU acceleration of probabilistic frequent itemset mining from uncertain databases BIBAFull-Text 892-901
  Yusuke Kozawa; Toshiyuki Amagasa; Hiroyuki Kitagawa
Uncertain databases have been widely developed to deal with the vast amount of data that contain uncertainty. To extract valuable information from the uncertain databases, several methods of frequent itemset mining, one of the major data mining techniques, have been proposed. However, their performance is not satisfactory because handling uncertainty incurs high processing costs. In order to address this problem, we utilize GPGPU (General-Purpose computation on GPU). GPGPU implies using a GPU (Graphics Processing Unit), which is originally designed for processing graphics, to accelerate general purpose computation. In this paper, we propose a method of frequent itemset mining from uncertain databases using GPGPU. The main idea is to speed up probability computations by making the best use of GPU's high parallelism and low-latency memory. We also employ an algorithm to manipulate a bitstring and data-parallel primitives to improve performance in the other parts of the method. Extensive experiments show that our proposed method is up to two orders of magnitude faster than existing methods.
Completeness of queries over SQL databases BIBAFull-Text 902-911
  Werner Nutt; Simon Razniewski
Data completeness is an important aspect of data quality. We consider a setting, where databases can be incomplete in two ways: records may be missing and records may contain null values. We (i) formalize when the answer set of a query is complete in spite of such incompleteness, and (ii) we introduce table completeness statements, by which one can express that certain parts of a database are complete. We then study how to deduce from a set of table-completeness statements that a query can be answered completely.
   Null values as used in SQL are ambiguous. They can indicate either that no attribute value exists or that a value exists, but is unknown. We study completeness reasoning for the different interpretations. We show that in the combined case it is necessary to syntactically distinguish between different kinds of null values and present an encoding for doing that in standard SQL databases. With this technique, any SQL DBMS evaluates complete queries correctly with respect to the different meanings that nulls can carry. We study the complexity of completeness reasoning and provide algorithms that in most cases agree with the worst-case lower bounds.

DB track: top-k and nearest neighbor queries

Being picky: processing top-k queries with set-defined selections BIBAFull-Text 912-921
  Aleksandar Stupar; Sebastian Michel
Focusing on the top-K items according to a ranking criterion constitutes an important functionality in many different query answering scenarios. The idea is to read only the necessary information -- mostly from secondary storage -- with the ultimate goal to achieve low latency. In this work, we consider processing such top-K queries under the constraint that the result items are members of a specific set, which is provided at query time. We call this restriction a set-defined selection criterion. Set-defined selections drastically influence the pros and cons of an id-ordered index vs. a score-ordered index. We present a mathematical model that allows to decide at runtime which index to choose, leading to a combined index. To improve the latency around the break even point of the two indices, we show how to benefit from a partitioned score-ordered index and present an algorithm to create such partitions based on analyzing query logs. Further performance gains can be enjoyed using approximate top-K results, with tunable result quality. The presented approaches are evaluated using both real-world and synthetic data.
Finding top k most influential spatial facilities over uncertain objects BIBAFull-Text 922-931
  Liming Zhan; Ying Zhang; Wenjie Zhang; Xuemin Lin
Uncertainty is inherent in many important applications, such as location-based services (LBS), sensor monitoring and radio-frequency identification (RFID). Recently, considerable research efforts have been put into the field of uncertainty-aware spatial query processing. In this paper, we study the problem of finding top k most influential facilities over a set of uncertain objects, which is an important spatial query in the above applications. Based on the maximal utility principle, we propose a new ranking model to identify the top k most influential facilities, which carefully captures influence of facilities on the uncertain objects. By utilizing two uncertain object indexing techniques, R-tree and U-Quadtree, effective and efficient algorithms are proposed following the filtering and verification paradigm, which significantly improves the performance of the algorithms in terms of CPU and I/O costs. Comprehensive experiments on real datasets demonstrate the effectiveness and efficiency of our techniques.
Efficient safe-region construction for moving top-K spatial keyword queries BIBAFull-Text 932-941
  Weihuang Huang; Guoliang Li; Kian-Lee Tan; Jianhua Feng
Many real-world applications have requirements to support moving spatial keyword queries. For example a tourist looks for top-k "seafood restaurants" while walking in a city. She will continuously issue moving queries. However existing spatial keyword search methods focus on static queries and it calls for new effective techniques to support moving queries efficiently. In this paper we propose an effective method to support moving top-k spatial keyword queries. In addition to finding top-k answers of a moving query, we also calculate a safe region such that if a new query with a location falling in the safe region, we can directly use the answer set to answer the query. To this end, we propose an effective model to represent the safe region and devise efficient search algorithms to compute the safe region. We have implemented our method and experimental results on real datasets show that our method achieves high efficiency and outperforms existing methods significantly.
Monochromatic and bichromatic reverse nearest neighbor queries on land surfaces BIBAFull-Text 942-951
  Da Yan; Zhou Zhao; Wilfred Ng
Finding reverse nearest neighbors (RNNs) is an important operation in spatial databases. The problem of evaluating RNN queries has already received considerable attention due to its importance in many real-world applications, such as resource allocation and disaster response. While RNN query processing has been extensively studied in Euclidean space, no work ever studies this problem on land surfaces. However, practical applications of RNN queries involve terrain surfaces that constrain object movements, which rendering the existing algorithms inapplicable.
   In this paper, we investigate the evaluation of two types of RNN queries on land surfaces: monochromatic RNN (MRNN) queries and bichromatic RNN (BRNN) queries. On a land surface, the distance between two points is calculated as the length of the shortest path along the surface. However, the computational cost of the state-of-the-art shortest path algorithm on a land surface is quadratic to the size of the surface model, which is usually quite huge. As a result, surface RNN query processing is a challenging problem.
   Leveraging some newly-discovered properties of Voronoi cell approximation structures, we make use of standard index structures such as an R-tree to design efficient algorithms that accelerate the evaluation of MRNN and BRNN queries on land surfaces. Our proposed algorithms are able to localize query evaluation by accessing just a small fraction of the surface data near the query point, which helps avoid shortest path evaluation on a large surface. Extensive experiments are conducted on large real-world datasets to demonstrate the efficiency of our algorithms.
Pay-as-you-go maintenance of precomputed nearest neighbors in large graphs BIBAFull-Text 952-961
  Tom Crecelius; Ralf Schenkel
An important building block of many graph applications such as searching in social networks, keyword search in graphs, and retrieval of linked documents is retrieving the transitive neighbors of a node in ascending order of their distances. Since large graphs cannot be kept in memory and graph traversals at query time would be prohibitively expensive, the list of neighbors for each node is usually precomputed and stored in a compact form. While the problem of precomputing all-pairs shortest distances has been well studied for decades, efficiently maintaining this information when the graph changes is not as well understood. This paper presents an algorithm for maintaining nearest neighbor lists in weighted graphs under node insertions and decreasing edge weights. It considers the important case where queries are a lot more frequent than updates, and presents two approaches for transparently performing necessary index updates while executing queries. Extensive experiments with large graphs, including a subset of Twitter's user graph, demonstrate that the overhead for this maintenance is small.

KM track: spatial and temporal methods

Spatial influence vs. community influence: modeling the global spread of social media BIBAFull-Text 962-971
  Krishna Y. Kamath; James Caverlee; Zhiyuan Cheng; Daniel Z. Sui
In this paper we seek to understand and model the global spread of social media. How does social media spread from location to location across the globe? Can we model this spread and predict where social media will be popular in the future? Toward answering these questions, we develop a probabilistic model that synthesizes two conflicting hypotheses about the nature of online information spread: (i) the spatial influence model, which asserts that social media spreads to locations that are close by; and (ii) the community affinity influence model, which asserts that social media spreads between locations that are culturally connected, even if they are distant. Based on the geospatial footprint of 755 million geo-tagged hashtags spread through Twitter, we evaluate these models at predicting locations that will adopt hashtags in the future. We find that distance is the single most important explanation of future hashtag adoption since hashtags are fundamentally local. We also find that community affinities (like culture, language, and common interests) enhance the quality of purely spatial models, indicating the necessity of incorporating non-spatial features into models of global social media spread.
TUT: a statistical model for detecting trends, topics and user interests in social media BIBAFull-Text 972-981
  Xuning Tang; Christopher C. Yang
The rapid development of online social media sites is accompanied by the generation of tremendous web contents. Web users are shifting from data consumers to data producers. As a result, topic detection and tracking without taking users' interests into account is not enough. This paper presents a statistical model that can detect interpretable trends and topics from document streams, where each trend (short for trending story) corresponds to a series of continuing events or a storyline. A topic is represented by a cluster of words frequently co-occurred. A trend can contain multiple topics and a topic can be shared by different trends. In addition, by leveraging a Recurrent Chinese Restaurant Process (RCRP), the number of trends in our model can be determined automatically without human intervention, so that our model can better generalize to unseen data. Furthermore, our proposed model incorporates user interest to fully simulate the generation process of web contents, which offers the opportunity for personalized recommendation in online social media. Experiments on three different datasets indicated that our proposed model can capture meaningful topics and trends, monitor rise and fall of detected trends, outperform baseline approach in terms of perplexity on held-out dataset, and improve the result of user participation prediction by leveraging users' interests to different trends.
Predicting aggregate social activities using continuous-time stochastic process BIBAFull-Text 982-991
  Shu Huang; Min Chen; Bo Luo; Dongwon Lee
How to accurately model and predict the future status of social networks has become an important problem in recent years. Conventional solutions to such a problem often employ topological structure of the sociogram, i.e., friendship links. However, they often disregard different levels of activeness of social actors and become insufficient to deal with complex dynamics of user behaviors. In this paper, to address this issue, we first refine the notion of social activity to better describe dynamic user behaviors in social networks. We then propose a Parameterized Social Activity Model (PSAM) using continuous-time stochastic process for predicting aggregate social activities. With social activities evolving over time, PSAM itself also evolves and therefore dynamically captures the real-time characteristics of the current active population. Our experiments using two real social networks (Facebook and CiteSeer) reveal that the proposed PSAM model is effective in simulating social activity evolution and predicting aggregate social activities accurately at different time scales.
Acquiring temporal constraints between relations BIBAFull-Text 992-1001
  Partha Pratim Talukdar; Derry Wijaya; Tom Mitchell
We consider the problem of automatically acquiring knowledge about the typical temporal orderings among relations (e.g., actedIn (person, film) typically occurs before wonPrize (film, award)), given only a database of known facts (relation instances) without time information, and a large document collection. Our approach is based on the conjecture that the narrative order of verb mentions within documents correlates with the temporal order of the relations they represent. We propose a family of algorithms based on this conjecture, utilizing a corpus of 890m dependency parsed sentences to obtain verbs that represent relations of interest, and utilizing Wikipedia documents to gather statistics on narrative order of verb mentions. Our proposed algorithm, GraphOrder, is a novel and scalable graph-based label propagation algorithm that takes transitivity of temporal order into account, as well as these statistics on narrative order of verb mentions. This algorithm achieves as high as 38.4% absolute improvement in F1 over a random baseline. Finally, we demonstrate the utility of this learned general knowledge about typical temporal orderings among relations, by showing that these temporal constraints can be successfully used by a joint inference framework to assign specific temporal scopes to individual facts.

IR track: web search

Towards optimum query segmentation: in doubt without BIBAFull-Text 1015-1024
  Matthias Hagen; Martin Potthast; Anna Beyer; Benno Stein
Query segmentation is the problem of identifying those keywords in a query, which together form compound concepts or phrases like "new york times". Such segments can help a search engine to better interpret a user's intents and to tailor the search results more appropriately. Our contributions to this problem are threefold. (1) We conduct the first large-scale study of human segmentation behavior based on more than 500000 segmentations. (2) We show that the traditionally applied segmentation accuracy measures are not appropriate for such large-scale corpora and introduce new, more robust measures. (3) We develop a new query segmentation approach with the basic idea that, in cases of doubt, it is often better to (partially) leave queries without any segmentation.
   This new in-doubt-without approach chooses different segmentation strategies depending on query types. A large-scale evaluation shows substantial improvement upon the state of the art in terms of segmentation accuracy. To draw a complete picture, we also evaluate the impact of segmentation strategies on retrieval performance in a TREC setting. It turns out that more accurate segmentation not necessarily yields better retrieval performance. Based on this insight, we propose an in-doubt-without variant which achieves the best retrieval performance despite leaving many queries unsegmented. But there is still room for improvement: the optimum segmentation strategy which always chooses the segmentation that maximizes retrieval performance, significantly outperforms all other tested approaches.
Leaving so soon?: understanding and predicting web search abandonment rationales BIBAFull-Text 1025-1034
  Abdigani Diriye; Ryen White; Georg Buscher; Susan Dumais
Users of search engines often abandon their searches. Despite the high frequency of Web search abandonment and its importance to Web search engines, little is known about why searchers abandon beyond that it can be for good or bad reasons. In this paper, we ex-tend previous work by studying search abandonment using both a retrospective survey and an in-situ method that captures abandonment rationales at abandonment time. We show that although satisfaction is a common motivator for abandonment, one-in-five abandonment instances does not relate to satisfaction. We also studied the automatic prediction of the underlying reason for observed abandonment. We used features of the query and the results, interaction with the result page (e.g., cursor movements, scrolling, clicks), and the full search session. We show that our classifiers can learn to accurately predict the reasons for observed search abandonment. Such accurate predictions help search providers estimate user satisfaction for queries without clicks, affording a more complete understanding of search engine performance.
Click patterns: an empirical representation of complex query intents BIBAFull-Text 1035-1044
  Huizhong Duan; Emre Kiciman; ChengXiang Zhai
Understanding users' search intents is critical component of modern search engines. A key limitation made by most query log analyses is the assumption that each clicked web result represents one unique intent. However, there are many search tasks, such as comparison shopping or in-depth research, where a user's intent is to explore many documents. In these cases, the assumption of a one-to-one correspondence between clicked documents and user intent breaks down.
   To capture and understand such behaviors, we propose the use of click patterns. Click patterns capture the relationship among clicks on search results by treating the set of clicks made by a user as a single unit. We aggregate click patterns together using a hierarchical clustering algorithm to discover the common click patterns. By using click patterns as an empirical representation of user intent, we are able to create a rich representation of mixtures of multiple navigational and informational intents. We analyze real search logs and demonstrate that such complex mixtures of intents do occur in the wild and can be identified using click patterns.
   We further demonstrate the usefulness of click patterns by integrating them into a measure of query ambiguity and into a query recommendation task. We show that calculating query ambiguity as the entropy over the distribution of click patterns provides a measure of ambiguity with improved discriminative power, consistency and temporal stability as compared to previous measures of ambiguity. We explore the use of click pattern similarity and click pattern entropy in generating query recommendations and show promising results.
Domain dependent query reformulation for web search BIBAFull-Text 1045-1054
  Van Dang; Giridhar Kumaran; Adam Troy
Query reformulation has been studied as a domain independent task. Existing work attempts to expand a query or substitute its terms with the same set of candidates regardless of the domain of this query. Since terms might be semantically related in one domain but not in others, it is more effective to provide candidates for queries with respect to their domain. This paper demonstrates the advantage of this domain dependent query reformulation approach, which learns its candidates, using a standard technique, for each domain from a separate sample of data derived automatically from a generic query log. Our results show that our approach statistically significantly outperforms the domain independent approach, which learns to reformulate from the same log using the same technique, on a large query set consisting of both health and commerce queries. Our results have very practical interpretation: while building different reformulation systems to handle queries from different domains does not require additional manual effort, it provides substantially better retrieval effectiveness than having a single system handling all queries. Additionally, we show that leveraging domain specific manually labelled data leads to further improvement.

DB track: web data management

An automatic blocking mechanism for large-scale de-duplication tasks BIBAFull-Text 1055-1064
  Anish Das Sarma; Ankur Jain; Ashwin Machanavajjhala; Philip Bohannon
De-duplication -- identification of distinct records referring to the same real-world entity -- is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, blocking has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off recall of identified duplicates for efficiency. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy formation be implemented as a hash function, making the canopy design problem more challenging. We present CBLOCK, a system that addresses these challenges.
   CBLOCK learns hash functions automatically from attribute domains and a labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions using a hierarchical tree structure composed of atomic hash functions. The application may guide the automated blocking process based on architectural constraints, such as by specifying a maximum size of each block (based on memory requirements), impose disjointness of blocks (in a grid environment), or specify a particular objective function trading off recall for efficiency. As a post-processing step to automatically generated blocks, CBLOCK rolls-up smaller blocks to increase recall. We present experimental results on two large-scale de-duplication datasets from a commercial search engine -- consisting of over 140K movies and 40K restaurants respectively -- and demonstrate the utility of CBLOCK.
Processing continuous text queries featuring non-homogeneous scoring functions BIBAFull-Text 1065-1074
  Nelly Vouzoukidou; Bernd Amann; Vassilis Christophides
In this work we are interested in the scalable processing of content filtering queries over text item streams. In particular, we are aiming to generalize state of the art solutions with non-homogeneous scoring functions combining query-independent item importance with query-dependent content relevance. While such complex ranking functions are widely used in web search engines this is to our knowledge the first scientific work studying their usage in a continuous query scenario. Our main contribution consists in the definition and the evaluation of new efficient in-memory data structures for indexing continuous top-k queries based on an original two-dimensional representation of text queries. We are exploring locally-optimal score bounds and heuristics that efficiently prune the search space of candidate top-k query results which have to be updated at the arrival of new stream items. Finally, we experimentally evaluate memory/matching time trade-offs of these index structures. In particular we experimentally illustrate their linear scaling behavior with respect to the number of indexed queries.
Comprehension-based result snippets BIBAFull-Text 1075-1084
  Abhijith Kashyap; Vagelis Hristidis
Result snippets are used by most search interfaces to preview query results. Snippets help users quickly decide the relevance of the results, thereby reducing the overall search time and effort. Most work on snippets have focused on text snippets for Web pages in Web search. However, little work has studied the problem of snippets for structured data, e.g., product catalogs. Furthermore, all works have focused on the important goal of creating informative snippets, but have ignored the amount of user effort required to comprehend, i.e., read and digest, the displayed snippets. In particular, they implicitly assume that the comprehension effort or cost only depends on the length of the snippet, which we show is incorrect for structured data. We propose novel techniques to construct snippets of structured heterogeneous results, which not only select the most informative attributes for each result, but also minimize the expected user effort (time) to comprehend these snippets. We create a comprehension model to quantify the effort incurred by users in comprehending a list of result snippets. Our model is supported by an extensive user-study. A key observation is that the user effort for comprehending an attribute across multiple snippets only depends on the number of unique positions (e.g., indentations) where this attribute is displayed and not on the number of occurrences. We analyze the complexity of the snippet construction problem and show that the problem is NP-hard, even when we only consider the comprehension cost. We present efficient approximate algorithms, and experimentally demonstrate their effectiveness and efficiency.
An effective rule miner for instance matching in a web of data BIBAFull-Text 1085-1094
  Xing Niu; Shu Rong; Haofen Wang; Yong Yu
Publishing structured data and linking them to Linking Open Data (LOD) is an ongoing effort to create a Web of data. Each newly involved data source may contain duplicated instances (entities) whose descriptions or schemata differ from those of the existing sources in LOD. To tackle this heterogeneity issue, several matching methods have been developed to link equivalent entities together. Many general-purpose matching methods which focus on similarity metrics suffer from very diverse matching results for different data source pairs. On the other hand, the dataset-specific ones leverage heuristic rules or even manual efforts to ensure the quality, which makes it impossible to apply them to other sources or domains. In this paper, we offer a third choice, a general method of automatically discovering dataset-specific matching rules. In particular, we propose a semi-supervised learning algorithm to iteratively refine matching rules and find new matches of high confidence based on these rules. This dramatically relieves the burden on users of defining rules but still gives high-quality matching results. We carry out experiments on real-world large scale data sources in LOD; the results show the effectiveness of our approach in terms of the precision of discovered matches and the number of missing matches found. Furthermore, we discuss several extensions (like similarity embedded rules, class restriction and SPARQL rewriting) to fit various applications with different requirements.

KM track: information extraction

Non-stationary Bayesian networks based on perfect simulation BIBAFull-Text 1095-1104
  Yi Jia; Wenrong Zeng; Jun Huan
Non-stationary Dynamic Bayesian Networks (Non-stationary DBNs) are widely used to model the temporal changes of directed dependency structures from multivariate time series data. However, the existing change-points based non-stationary DBNs methods have several drawbacks including excessive computational cost, and low convergence speed. In this paper we proposed a novel non-stationary DBNs method. Our method is based on the perfect simulation model. We applied this approach for network structure inference from synthetic data and biological microarray gene expression data and compared it with other two state-of-the-art non-stationary DBNs methods. The experimental results demonstrated that our method outperformed two other state-of-the-art methods in both computational cost and structure prediction accuracy. The further sensitivity analysis showed that once converged our model is robust to large parameter ranges, which reduces the uncertainty of the model behavior.
Active learning for relation type extension with local and global data views BIBAFull-Text 1105-1112
  Ang Sun; Ralph Grishman
Relation extraction is the process of identifying instances of specified types of semantic relations in text; relation type extension involves extending a relation extraction system to recognize a new type of relation. We present LGCo-Testing, an active learning system for relation type extension based on local and global views of relation instances. Locally, we extract features from the sentence that contains the instance. Globally, we measure the distributional similarity between instances from a 2 billion token corpus. Evaluation on the ACE 2004 corpus shows that LGCo-Testing can reduce annotation cost by 97% while maintaining the performance level of supervised learning.
Segmenting web-domains and hashtags using length specific models BIBAFull-Text 1113-1122
  Sriram Srinivasan; Sourangshu Bhattacharya; Rudrasis Chakraborty
Segmentation of a string of English language characters into a sequence of words has many applications. Here, we study two applications in the internet domain. First application is the web domain segmentation which is crucial for monetization of broken URLs. Secondly, we propose and study a novel application of twitter hashtag segmentation for increasing recall on twitter searches. Existing methods for word segmentation use unsupervised language models. We find that when using multiple corpora, the joint probability model from multiple corpora performs significantly better than the individual corpora. Motivated by this, we propose weighted joint probability model, with weights specific to each corpus. We propose to train the weights in a supervised manner using max-margin methods. The supervised probability models improve segmentation accuracy over joint probability models. Finally, we observe that length of segments is an important parameter for word segmentation, and incorporate length-specific weights into our model. The length specific models further improve segmentation accuracy over supervised probability models. For all models proposed here, inference problem can be solved using the dynamic programming algorithm. We test our methods on five different datasets, two from web domains data, and three from news headlines data from an LDC dataset. The supervised length specific models show significant improvements over unsupervised single corpus and joint probability models. Cross-testing between the datasets confirm that supervised probability models trained on all datasets, and length specific models trained on news headlines data, generalize well. Segmentation of hashtags result in significant improvement in recall on searches for twitter trends.
Crosslingual distant supervision for extracting relations of different complexity BIBAFull-Text 1123-1132
  Andre Blessing; Hinrich Schütze
We propose crosslingual distant supervision (crosslingual DS) for relation extraction, an approach that automatically extracts labels from a pivot language for labeling one or more target languages. The approach has two benefits compared to standard DS: (i) increased coverage if target language labels are not available; and (ii) higher accuracy of automatically generated labels because noisy labels are eliminated in crosslingual filtering. An evaluation for two relations of different complexity shows that crosslingual DS increases the accuracy of relation extraction. Our approach is language independent; we successfully apply it to four different languages: Chinese, English, French and German.
Labeling by landscaping: classifying tokens in context by pruning and decorating trees BIBAFull-Text 1133-1142
  Siddharth Patwardhan; Branimir Boguraev; Apoorv Agarwal; Alessandro Moschitti; Jennifer Chu-Carroll
State-of-the-art approaches to token labeling within text documents typically cast the problem either as a classification task, without using complex structural characteristics of the input, or as a sequential labeling task, carried out by a Conditional Random Field (CRF) classifier. Here we explore principled ways for structure to be brought to bear on the task. In line with recent trends in statistical learning of structured natural language input, we use a Support Vector Machine (SVM) classification framework deploying tree kernels. We then propose tree transformations and decorations, as a methodology for modeling complex linguistic phenomena in highly multi-dimensional feature spaces. We develop a general purpose tree engineering framework, which enables us to transcend the typically complex and laborious process of feature engineering. We build kernel based classifiers for two token labeling tasks: fine-grained event recognition, and lexical answer type detection in questions. For both, we show that in comparison with a corresponding linear kernel SVM, our method of using tree kernels improves recognition, thanks to appropriately engineering tree structures for use by the tree kernel. We also observe significant improvements when comparing with a CRF-based realization of structured prediction, itself performing at levels comparable to state-of-the-art.

IR track: topic modeling and content and sentiment analysis

G-WSTD: a framework for geographic web search topic discovery BIBAFull-Text 1143-1152
  Di Jiang; Jan Vosecky; Kenneth Wai-Ting Leung; Wilfred Ng
Search engine query log is an important information source that contains millions of users' interests and information needs. In this paper, we tackle the problem of discovering latent geographic search topics via mining search engine query logs. A novel framework G-WSTD that contains search session derivation, geographic information extraction and geographic search topic discovery is developed to support a variety of downstream web applications. The core components of the framework are two topic models, which discover geographic search topics from two different perspectives. The first one is the Discrete Search Topic Model (DSTM), which aims to capture the semantic commonalities across discrete geographic locations. The second one is the Regional Search Topic Model (RSTM), which focuses on a specific region on the map and discovers web search topics that demonstrate geographic locality. We evaluate our framework against several strong baselines on a real-life query log. The framework demonstrates improved data interpretability, better prediction performance and higher topic distinctiveness in the experimentation. The effectiveness of the framework is also verified by applications such as user profiling and URL annotation.
Supporting factual statements with evidence from the web BIBAFull-Text 1153-1162
  Chee Wee Leong; Silviu Cucerzan
Fact verification has become an important task due to the increased popularity of blogs, discussion groups, and social sites, as well as of encyclopedic collections that aggregate content from many contributors. We investigate the task of automatically retrieving supporting evidence from the Web for factual statements. Using Wikipedia as a starting point, we derive a large corpus of statements paired with supporting Web documents, which we employ further as training and test data under the assumption that the contributed references to Wikipedia represent some of the most relevant Web documents for supporting the corresponding statements. Given a factual statement, the proposed system first transforms it into a set of semantic terms by using machine learning techniques. It then employs a quasi-random strategy for selecting subsets of the semantic terms according to topical likelihood. These semantic terms are used to construct queries for retrieving Web documents via a Web search API. Finally, the retrieved documents are aggregated and re-ranked by employing additional measures of their suitability to support the factual statement. To gauge the quality of the retrieved evidence, we conduct a user study through Amazon Mechanical Turk, which shows that our system is capable of retrieving supporting Web documents comparable to those chosen by Wikipedia contributors.
Role-explicit query identification and intent role annotation BIBAFull-Text 1163-1172
  Haitao Yu; Fuji Ren
Understanding the information need or intent encoded within a query has long been regarded as an essential factor of effective information retrieval. For better query representation and understanding, two intent roles (kernel-object and modifier) are introduced to structurally parse a class of role-explicit queries, which constitute a majority of common user queries. Furthermore, we focus on two research problems: RP-1: Given a role-explicit query, how to identify the kernel-object and modifier, namely intent role annotation; RP-2: How to determine whether an arbitrary query is role-explicit or not. To solve RP-1, we propose a simplified word n-gram role model (SWNR), which quantifies the generating probability of a role-explicit query and performs intent role annotation effectively. Using a set of discriminative features, we build classifiers to address RP-2 in a supervised manner. The experimental results show that: (1) SWNR can achieve a satisfactory performance, more than 73% in terms of different metrics; (2) The classifiers can achieve more than 90% precision in identifying role-explicit queries; (3) Compared with traditional techniques for query representation and understanding, e.g., name entity recognition in query and class-level query intent inference, intent role annotation provides a more flexible framework and a number of applications can benefit from annotating role-explicit queries, such as intent mining and diversified document ranking.
Joint topic modeling for event summarization across news and social media streams BIBAFull-Text 1173-1182
  Wei Gao; Peng Li; Kareem Darwish
Social media streams such as Twitter are regarded as faster first-hand sources of information generated by massive users. The content diffused through this channel, although noisy, provides important complement and sometimes even a substitute to the traditional news media reporting. In this paper, we propose a novel unsupervised approach based on topic modeling to summarize trending subjects by jointly discovering the representative and complementary information from news and tweets. Our method captures the content that enriches the subject matter by reinforcing the identification of complementary sentence-tweet pairs. To valuate the complementarity of a pair, we leverage topic modeling formalism by combining a two-dimensional topic-aspect model and a cross-collection approach in the multi-document summarization literature. The final summaries are generated by co-ranking the news sentences and tweets in both sides simultaneously. Experiments give promising results as compared to state-of-the-art baselines.

DB track: query processing, optimization and performance

CGStream: continuous correlated graph query for data streams BIBAFull-Text 1183-1192
  Shirui Pan; Xingquan Zhu
In this paper, we propose to query correlated graph in a data stream scenario, where given a query graph q an algorithm is required to retrieve all the subgraphs whose Pearson's correlation coefficients with q are greater than a threshold θ over some graph data flowing in a stream fashion. Due to the dynamic changing nature of the stream data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. In the paper, we propose a novel algorithm, CGStream, to identify correlated graphs from data stream, by using a sliding window which covers a number of consecutive batches of stream data records. Our theme is to regard stream query as the traversing along a data stream and the query is achieved at a number of outlooks over the data stream. For each outlook, we derive a lower frequency bound to mine a set of frequent subgraph candidates, where the lower bound guarantees that no pattern is missing from the current outlook to the next outlook. On top of that, we derive an upper correlation bound and a heuristic rule to prune the candidate size, which helps reduce the computation cost at each outlook. Experimental results demonstrate that the proposed algorithm is several times, or even an order of magnitude, more efficient than the straightforward algorithm. Meanwhile, our algorithm achieves good performance in terms of query precision.
Efficient influence-based processing of market research queries BIBAFull-Text 1193-1202
  Anastasios Arvanitis; Antonios Deligiannakis; Yannis Vassiliou
The rapid growth of social web has contributed vast amounts of user preference data. Analyzing this data and its relationships with products could have several practical applications, such as personalized advertising, market segmentation, product feature promotion etc. In this work we develop novel algorithms for efficiently processing two important classes of queries involving user preferences, i.e. potential customers identification and product positioning. With regards to the first problem, we formulate product attractiveness based on the notion of reverse skyline queries. We then present a new algorithm, termed as RSA, that significantly reduces the I/O cost, as well as the computation cost, when compared to the state-of-the-art reverse skyline algorithm, while at the same time being able to quickly report the first results. Several real-world applications require processing of a large number of queries, in order to identify the product characteristics that maximize the number of potential customers. Motivated by this problem, we also develop a batched extension of our RSA algorithm that significantly improves upon processing multiple queries individually, by grouping contiguous candidates, exploiting I/O commonalities and enabling shared processing. Our experimental study using both real and synthetic data sets demonstrates the superiority of our proposed algorithms for the studied classes of queries.
Deco: declarative crowdsourcing BIBAFull-Text 1203-1212
  Aditya Ganesh Parameswaran; Hyunjung Park; Hector Garcia-Molina; Neoklis Polyzotis; Jennifer Widom
Crowdsourcing enables programmers to incorporate "human computation" as a building block in algorithms that cannot be fully automated, such as text analysis and image recognition. Similarly, humans can be used as a building block in data-intensive applications -- providing, comparing, and verifying data used by applications. Building upon the decades-long success of declarative approaches to conventional data management, we use a similar approach for data-intensive applications that incorporate humans. Specifically, declarative queries are posed over stored relational data as well as data computed on-demand from the crowd, and the underlying system orchestrates the computation of query answers.
   We present Deco, a database system for declarative crowdsourcing. We describe Deco's data model, query language, and our prototype. Deco's data model was designed to be general (it can be instantiated to other proposed models), flexible (it allows methods for data cleansing and external access to be plugged in), and principled (it has a precisely-defined semantics). Syntactically, Deco's query language is a simple extension to SQL. Based on Deco's data model, we define a precise semantics for arbitrary queries involving both stored data and data obtained from the crowd. We then describe the Deco query processor which uses a novel push-pull hybrid execution model to respect the Deco semantics while coping with the unique combination of latency, monetary cost, and uncertainty introduced in the crowdsourcing environment. Finally, we experimentally explore the query processing alternatives provided by Deco using our current prototype.
Predicting the effectiveness of keyword queries on databases BIBAFull-Text 1213-1222
  Shiwen Cheng; Arash Termehchy; Vagelis Hristidis
Keyword query interfaces (KQIs) for databases provide easy access to data, but often suffer from low ranking quality, i.e. low precision and/or recall, as shown in recent benchmarks. It would be useful to be able to identify queries that are likely to have low ranking quality to improve the user satisfaction. For instance, the system may suggest to the user alternative queries for such hard queries. In this paper, we analyze the characteristics of hard queries and propose a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and the content of the database and the query results. We evaluate our query difficulty prediction model against two relevance judgment benchmarks for keyword search on databases, INEX and SemSearch. Our study shows that our model predicts the hard queries with high accuracy. Further, our prediction algorithms incur minimal time overhead.
You can stop early with COLA: online processing of aggregate queries in the cloud BIBAFull-Text 1223-1232
  Yingjie Shi; Xiaofeng Meng; Fusheng Wang; Yantao Gan
Cloud-based data management systems are emerging as scalable, fault-tolerant, and efficient solutions to manage large volumes of data with cost effective infrastructures, and more and more data analysis applications are migrated to the cloud. As an attractive solution to provide a quick sketch of massive data before a long wait of the final accurate query result, online processing of aggregate queries in the cloud is of paramount importance. This problem is challenging to solve because of the large block based data organization and distributed processing mode in the cloud. In this paper, we present COLA, a system for Cloud Online Aggregation to provide progressive approximate answers for both single tables and joined multiple tables. We develop an online query processing algorithm for MapReduce to support incremental and continuous computing of aggregations on joins which minimizes the waiting time before an acceptable estimate is achieved. We formulate a statistical foundation that supports block-level sampling for single-table online aggregations and effective estimation of approximate results and confidence intervals of statistical significance. We also develop a two-phase stratified sampling method to support multi-table aggregations to improve the approximate query answers and speed up the convergence of confidence intervals. We implement COLA in Hadoop, and our experiments demonstrate that COLA can deliver reasonable precise online estimates within a time period two orders of magnitude shorter than that used to produce exact answers.

KM track: classification and semantic methods

A novel local patch framework for fixing supervised learning models BIBAFull-Text 1233-1242
  Yilei Wang; Bingzheng Wei; Jun Yan; Yang Hu; Zhi-Hong Deng; Zheng Chen
In the past decades, machine learning models, especially supervised learning algorithms, have been widely used in various real world applications. However, no matter how strong a learning model is, it will suffer from the prediction errors when it is applied to real world problems. Due to the black box nature of supervised learning models, it is a challenging problem to fix the supervised learning models by further learning from the failure cases it generates. In this paper, we propose a novel Local Patch Framework (LPF) to locally fix supervised learning models by learning from its predicted failure cases. Since the learning models are generally globally optimized during training process, our proposed LPF assumes that most of the learning errors are led by local errors in the model. Thus we aim to break the black boxes of learning models by identifying and fixing the local errors of various models automatically. The proposed LPF has two key steps, which are local error region subspace learning and local patch model learning. Through this way, we aim to fix the errors of learning models locally and automatically with certain generalization ability on unseen testing data. Experiments on both classification and ranking problems show that the proposed LPF is effective and outperforms the original algorithms and the incremental learning model.
Automated feature weighting in naive Bayes for high-dimensional data classification BIBAFull-Text 1243-1252
  Lifei Chen; Shengrui Wang
Naive Bayes (NB for short) is one of the popular methods for supervised classification in a knowledge management system. Currently, in many real-world applications, high-dimensional data pose a major challenge to conventional NB classifiers, due to noisy or redundant features and local relevance of these features to classes. In this paper, an automated feature weighting solution is proposed to result in a NB method effective in dealing with high-dimensional data. We first propose a locally weighted probability model, for Bayesian modeling in high-dimensional spaces, to implement a soft feature selection scheme. Then we propose an optimization algorithm to find the weights in linear time complexity, based on the Logitnormal priori distribution and the Maximum a Posteriori principle. Experimental studies show the effectiveness and suitability of the proposed model for high-dimensional data classification.
Learning to discover complex mappings from web forms to ontologies BIBAFull-Text 1253-1262
  Yuan An; Xiaohua Hu; Il-Yeol Song
In order to realize the Semantic Web, various structures on the Web including Web forms need to be annotated with and mapped to domain ontologies. We present a machine learning-based automatic approach for discovering complex mappings from Web forms to ontologies. A complex mapping associates a set of semantically related elements on a form to a set of semantically related elements in an ontology. Existing schema mapping solutions mainly rely on integrity constraints to infer complex schema mappings. However, it is difficult to extract rich integrity constraints from forms. We show how machine learning techniques can be used to automatically discover complex mappings between Web forms and ontologies. The challenge is how to capture and learn the complicated knowledge encoded in existing complex mappings. We develop an initial solution that takes a naive Bayesian approach. We evaluated the performance of the solution on various domains. Our experimental results show that the solution returns the expected mappings as the top-1 results usually among several hundreds candidate mappings for more than 80% of the test cases. Furthermore, the expected mappings are always returned as the top-k results with k<4. The experiments have demonstrated that the approach is effective and has the potential to save significant human efforts.
Modeling semantic relations between visual attributes and object categories via dirichlet forest prior BIBAFull-Text 1263-1272
  Xin Chen; Xiaohua Hu; Zhongna Zhou; Yuan An; Tingting He; E. K. Park
In this paper, we deal with two research issues: the automation of visual attribute identification and semantic relation learning between visual attributes and object categories. The contribution is two-fold, firstly, we provide uniform framework to reliably extract both categorical attributes and depictive attributes. Secondly, we incorporate the obtained semantic associations between visual attributes and object categories into a text-based topic model and extract descriptive latent topics from external textual knowledge sources. Specifically, we show that in mining natural language descriptions from external knowledge sources, the relation between semantic visual attributes and object categories can be encoded as Must-Links and Cannot-Links, which can be represented by Dirichlet-Forest prior. To alleviate the workload of manual supervision and labeling in image categorization process, we introduce a semi-supervised training framework using soft-margin semi-supervised SVM classifier. We also show that the large-scale image categorization results can be significantly improved by combining automatically acquired visual attributes. Experimental results show that the proposed model achieves better ability in describing object-related attributes and makes the inferred latent topics more descriptive.
CoNet: feature generation for multi-view semi-supervised learning with partially observed views BIBAFull-Text 1273-1282
  Brian Quanz; Jun Huan
Multi-view semi-supervised learning methods try to exploit the combination of multiple views along with large amounts of unlabeled data in order to learn better predictive functions when limited labeled data is available. However, lack of complete view data limits the applicability of multi-view semi-supervised learning to real world data. Commonly, one data view is readily and cheaply available, but additionally views may be costly or only available in some cases. This work aims to make multi-view semi-supervised learning approaches more applicable to real world data specifically by addressing the issue of missing views.
   We introduce CoNet, a feature generation method that learns a mapping from one view to another that is specifically designed to produce features that are useful for multi-view semi-supervised learning algorithms. The mapping is then used to fill in views as pre-processing.
   Our comprehensive experimental study demonstrates the utility of our method as compared to the state-of-the-art multi-view semi-supervised learning methods for this scenario of partially observed views.

IR track: multimedia and user feedback

Generating facets for phone-based navigation of structured data BIBAFull-Text 1283-1292
  Krishna Kummamuru; Ajith Jujjuru; Mayuri Duggirala
Designing interactive voice systems that have optimum cognitive load on callers has been an active research topic for quite some time. There have been many studies comparing the user preferences on navigation trees with higher depths over higher breadths. In this paper, we consider the navigation of structured data containing various types of attributes using phone-based interactions. This problem is particularly relevant to emerging economies in which innovative voice-based applications are being built to address semi-literate population. We address the problem of identifying the right sequence of facets to be presented to the user for phone-based navigation of the data in two stages. Firstly, we perform extensive user studies in the target population to understand the relation between the nature of facets (attributes) of the data and the cognitive load. Secondly, we propose an algorithm to design optimum navigation trees based on the inferences made in the first phase. We compare the proposed algorithm with the traditional facet generation algorithms with respect to various factors and discuss the optimality of the proposed algorithm.
The effect of aggregated search coherence on search behavior BIBAFull-Text 1293-1302
  Jaime Arguello; Robert Capra
Aggregated search is the task of blending results from different specialized search services, or verticals, into the web search results. Aggregated search coherence refers to the degree to which results from different systems focus on similar senses of the query. While cross-component coherence has been cited as an important criterion for whole-page evaluation, its effect on search behavior has not been deeply investigated in prior research. In this work, we focus on the coherence between two aggregated search components: images and web results. In particular, we investigate whether the query-senses associated with the blended image results can influence user interaction with the web results. For example, if a user wants web results about "jaguar" the animal, are they more likely to examine the web results if the image results contain pictures of the animal instead of pictures of the car? Based on two large user studies, our results show that the image results can systematically affect user interaction with the web results. If the web results are largely consistent with the search task, then the effect of the image results is small. However, if the web results are only marginally consistent with the search task, such as when they are highly diversified across query-senses, the image results have a significant effect on user interaction with the web results. Our findings have implications on current research in whole-page evaluation, aggregated search, and diversity ranking.
Improving bag-of-visual-words model with spatial-temporal correlation for video retrieval BIBAFull-Text 1303-1312
  Lei Wang; Dawei Song; Eyad Elyan
Most of the state-of-art approaches to Query-by-Example (QBE) video retrieval are based on the Bag-of-visual-Words (BovW) representation of visual content. It, however, ignores the spatial-temporal information, which is important for similarity measurement between videos. Direct incorporation of such information into the video data representation for a large scale data set is computationally expensive in terms of storage and similarity measurement. It is also static regardless of the change of discriminative power of visual words for different queries. To tackle these limitations, in this paper, we propose to discover Spatial-Temporal Correlations (STC) imposed by the query example to improve the BovW model for video retrieval. The STC, in terms of spatial proximity and relative motion coherence between different visual words, is crucial to identify the discriminative power of the visual words. We develop a novel technique to emphasize the most discriminative visual words for similarity measurement, and incorporate this STC-based approach into the standard inverted index architecture. Our approach is evaluated on the TRECVID2002 and CC_WEB_VIDEO datasets for two typical QBE video retrieval tasks respectively. The experimental results demonstrate that it substantially improves the BovW model as well as a state of the art method that also utilizes spatial-temporal information for QBE video retrieval.
Exploring and predicting search task difficulty BIBAFull-Text 1313-1322
  Jingjing Liu; Chang Liu; Michael Cole; Nicholas J. Belkin; Xiangmin Zhang
We report on an investigation of behavioral differences between users in difficult and easy search tasks. Behavioral factors that can be used in real-time to predict task difficulty are identified. User data was collected in a controlled lab experiment (n=38) where each participant completed four search tasks in the genomics domain. We looked at user behaviors that can be obtained by systems at three levels, distinguished by the time point when the measurements can be done. They are: 1) first-round level at the beginning of the search, 2) accumulated level during the search, and 3) whole-session level by the end of the search. Results show that a number of user behaviors at all three levels differed between easy and difficult tasks. Models predicting task difficulty at all three levels were developed and evaluated. A real-time model incorporating first-round and accumulated levels of behaviors (FA) had fairly good prediction performance (accuracy 83%; precision 88%), which is comparable with the model using the whole-session level behaviors which are not real-time (accuracy 75%; precision 92%). We also found that for efficiency purpose, using only a limited number of significant variables (FC_FA) can obtain a prediction accuracy of 75%, with a precision of 88%. Our findings can help search systems predict task difficulty and adapt search results to users.
Iterative relevance feedback with adaptive exploration/exploitation trade-off BIBAFull-Text 1323-1331
  Nicolae Suditu; François Fleuret
Content-based image retrieval systems have to cope with two different regimes: understanding broadly the categories of interest to the user, and refining the search in this or these categories to converge to specific images among them. Here, in contrast with other types of retrieval systems, these two regimes are of great importance since the search initialization is hardly optimal (i.e. the page-zero problem) and the relevance feedback must tolerate the semantic gap of the image's visual features.
   We present a new approach that encompasses these two regimes, and infers from the user actions a seamless transition between them. Starting from a query-free approach meant to solve the page-zero problem, we propose an adaptive exploration/exploitation trade-off that transforms the original framework into a versatile retrieval framework with full searching capabilities. Our approach is compared to the state-of-the-art it extends by conducting user evaluations on a collection of 60,000 images from the ImageNet database.

DB track: emerging and advanced topics

A practical concurrent index for solid-state drives BIBAFull-Text 1332-1341
  Risi Thonangi; Shivnath Babu; Jun Yang
Solid-state drives are becoming a viable alternative to magnetic disks in database systems, but their performance characteristics, particularly those caused by their erase-before-write behavior, make conventional database indexes a poor fit. There have been various proposals of indexes specialized for these devices, but to make such indexes practical, we must address the issue of concurrency control. Good concurrency control is especially critical to indexes on solid-state drives, because they typically rely on batch updates, which may take long and block concurrent index accesses. We design, implement, and evaluate an index structure called FD+tree and an associated concurrency control scheme called FD+FC. Our evaluation confirms significant performance advantages of our approach over less sophisticated ones, and brings out insights on data structure design and OLTP performance tuning on solid-state drives.
Robust distributed indexing for locality-skewed workloads BIBAFull-Text 1342-1351
  Mu-Woong Lee; Seung-won Hwang
Multidimensional indexing is crucial for enabling a fast search over large-scale data. Owing to the unprecedented scale of data, extending such indexing technology has recently gained attention in distributed environments. The goal of existing efforts in distributed indexing has been the localization of queries to data residing at a small number of nodes (i.e., locality-preserving indexing) to minimize communication cost. However, considering that workloads often correlate with data locality, such indexing often generates hotspots. Location-based queries are typically skewed to disaster areas during certain periods of time, e.g., during Hurricane Irene, search traffic increased by more than 2000%. To alleviate such hotspots, we propose workload-balancing as an optimization goal. A cost model analytically supporting the need for load balancing is first developed, then a distributed index that evenly distributes the workload is presented. Our empirical study suggests that hotspots degrading search performance can be effectively alleviated. Specifically, when deployed to Amazon EC2, our proposed scheme showed maximum speed-up of 127.7%. Even in hostile settings where workload is not at all correlated with the search criteria, the proposed scheme's performance is comparable to existing approaches optimized for such settings.
Efficient provenance storage for relational queries BIBAFull-Text 1352-1361
  Zhifeng Bao; Henning Köhler; Liwei Wang; Xiaofang Zhou; Shazia Sadiq
Provenance information is vital in many application areas as it helps explain data lineage and derivation. However, storing fine-grained provenance information can be expensive. In this paper, we present a framework for storing provenance information relating to data derived via database queries. In particular, we first propose a provenance tree data structure which matches the query structure and thereby presents a possibility to avoid redundant storage of information regarding the derivation process. Then we investigate two approaches for reducing storage costs. The first approach utilizes two ingenious rules to achieve reduction on provenance trees. The second one is a dynamic programming solution, which provides a way of optimizing the selection of query tree nodes where provenance information should be stored. The optimization algorithm runs in polynomial time in the query size and is linear in the size of the provenance information, thus enabling provenance tracking and optimization without incurring large overheads. Experiments show that our approaches guarantee significantly lower storage costs than existing approaches.
Generically extending anonymization algorithms to deal with successive queries BIBAFull-Text 1362-1371
  Manuel Barbosa; Alexandre Pinto; Bruno Gomes
This paper addresses the scenario of multi-release anonymization of datasets. We consider dynamic datasets where data can be inserted and deleted, and view this scenario as a case where each release is a small subset of the dataset corresponding, for example, to the results of a query. Compared to multiple releases of the full database, this has the obvious advantage of faster anonymization. We present an algorithm for post-processing anonymized queries that prevents anonymity attacks using multiple released queries. This algorithm can be used with several distinct protection principles and anonymization algorithms, which makes it generic and flexible. We give an experimental evaluation of the algorithm and compare it to $m$-invariance both in terms of efficiency and data quality. To this end, we propose two data quality metrics based on Shannon's entropy, and show that they can be seen as a refinement of existing metrics.
Authentication of moving range queries BIBAFull-Text 1372-1381
  Duncan Yung; Eric Lo; Man Lung Yiu
A moving range query continuously reports the query result (e.g., restaurants) that are within radius $r$ from a moving query point (e.g., moving tourist). To minimize the communication cost with the mobile clients, a service provider that evaluates moving range queries also returns a safe region that bounds the validity of query results. However, an untrustworthy service provider may report incorrect safe regions to mobile clients. In this paper, we present efficient techniques for authenticating the safe regions of moving range queries. We theoretically proved that our methods for authenticating moving range queries can minimize the data sent between the service provider and the mobile clients. Extensive experiments are carried out using both real and synthetic datasets and results show that our methods incur small communication costs and overhead.

KM track: novel applications

Model the complex dependence structures of financial variables by using canonical vine BIBAFull-Text 1382-1391
  Wei Wei; Xuhui Fan; Jinyan Li; Longbing Cao
Financial variables such as asset returns in the massive market contain various hierarchical and horizontal relationships forming complicated dependence structures. Modeling and mining of these structures is challenging due to their own high structural complexities as well as the stylized facts of the market data. This paper introduces a new canonical vine dependence model to identify the asymmetric and non-linear dependence structures of asset returns without any prior independence assumptions. To simplify the model while maintaining its merit, a partial correlation based method is proposed to optimize the canonical vine. Compared with the original canonical vine, the new model can still maintain the most important dependence but many unimportant nodes are removed to simplify the canonical vine structure. Our model is applied to construct and analyze dependence structures of European stocks as case studies. Its performance is evaluated by measuring portfolio of Value at Risk, a widely used risk management measure. In comparison to a very recent canonical vine model and the 'full' model, our experimental results demonstrate that our model has a much better quality of Value at Risk, providing insightful knowledge for investors to control and reduce the aggregation risk of the portfolio.
A unified learning framework for auto face annotation by mining web facial images BIBAFull-Text 1392-1401
  Dayong Wang; Steven Chu Hong Hoi; Ying He
Auto face annotation plays an important role in many real-world multimedia information and knowledge management systems. Recently there is a surge of research interests in mining weakly-labeled facial images on the internet to tackle this long-standing research challenge in computer vision and image understanding. In this paper, we present a novel unified learning framework for face annotation by mining weakly labeled web facial images through interdisciplinary efforts of combining sparse feature representation, content-based image retrieval, transductive learning and inductive learning techniques. In particular, we first introduce a new search-based face annotation paradigm using transductive learning, and then propose an effective inductive learning scheme for training classification-based annotators from weakly labeled facial images, and finally unify both transductive and inductive learning approaches to maximize the learning efficacy. We conduct extensive experiments on a real-world web facial image database, in which encouraging results show that the proposed unified learning scheme outperforms the state-of-the-art approaches.
Efficient jaccard-based diversity analysis of large document collections BIBAFull-Text 1402-1411
  Fan Deng; Stefan Siersdorfer; Sergej Zerr
We propose two efficient algorithms for exploring topic diversity in large document corpora such as user generated content on the social web, bibliographic data, or other web repositories. Analyzing diversity is useful for obtaining insights into knowledge evolution, trends, periodicities, and topic heterogeneity of such collections. Calculating diversity statistics requires averaging over the similarity of all object pairs, which, for large corpora, is prohibitive from a computational point of view. Our proposed algorithms overcome the quadratic complexity of the average pair-wise similarity computation, and allow for constant time (depending on dataset properties) or linear time approximation with probabilistic guarantees. We show examples of diversity-based studies on large samples from corpora such as the social photo sharing site Flickr, the DBLP bibliography, and US Census data.
Knowing where and how criminal organizations operate using web content BIBAFull-Text 1412-1421
  Michele Coscia; Viridiana Rios
We develop a framework that uses Web content to obtain quantitative information about a phenomenon that would otherwise require the operation of large scale, expensive intelligence exercises. Exploiting indexed reliable sources such as online newspapers and blogs, we use unambiguous query terms to characterize a complex evolving phenomena and solve a security policy problem: identifying the areas of operation and modus operandi of criminal organizations, in particular, Mexican drug trafficking organizations over the last two decades. We validate our methodology by comparing information that is known with certainty with the one we extracted using our framework. We show that our framework is able to use information available on the web to efficiently extract implicit knowledge about criminal organizations. In the scenario of Mexican drug trafficking, our findings provide evidence that criminal organizations are more strategic and operate in more differentiated ways than current academic literature thought.

IR track: social networks

Social recommendation across multiple relational domains BIBAFull-Text 1422-1431
  Meng Jiang; Peng Cui; Fei Wang; Qiang Yang; Wenwu Zhu; Shiqiang Yang
Social networks enable users to create different types of personal items. In dealing with serious information overload, the major problems of social recommendation are sparsity and cold start. In existing approaches, relational and heterogeneous domains can not be effectively utilized for social recommendation, which brings a challenge to model users and multiple types of items together on social networks. In this paper, we consider how to represent social networks with multiple relational domains and alleviate the major problems in an individual domain by transferring knowledge from other domains. We propose a novel Hybrid Random Walk (HRW), which can integrate multiple heterogeneous domains including directed/undirected links, signed/unsigned links and within-domain/cross-domain links into a star-structured hybrid graph with user graph at the center. We perform random walk until convergence and use the steady state distribution for recommendation. We conduct experiments on a real social network dataset and show that our method can significantly outperform existing social recommendation approaches.
Mining competitive relationships by learning across heterogeneous networks BIBAFull-Text 1432-1441
  Yang Yang; Jie Tang; Jacklyne Keomany; Yanting Zhao; Juanzi Li; Ying Ding; Tian Li; Liangwei Wang
Detecting and monitoring competitors is fundamental to a company to stay ahead in the global market. Existing studies mainly focus on mining competitive relationships within a single data source, while competing information is usually distributed in multiple networks. How to discover the underlying patterns and utilize the heterogeneous knowledge to avoid biased aspects in this issue is a challenging problem. In this paper, we study the problem of mining competitive relationships by learning across heterogeneous networks. We use Twitter and patent records as our data sources and statistically study the patterns behind the competitive relationships. We find that the two networks exhibit different but complementary patterns of competitions. Our proposed model, Topical Factor Graph Model (TFGM), defines a latent topic layer to bridge the two networks and learns a semi-supervised learning model to classify the relationships between entities (e.g., companies or products). We test the proposed model on two real data sets and the experimental results validate the effectiveness of our model, with an average of +46% improvement over alternative methods.
Evaluating geo-social influence in location-based social networks BIBAFull-Text 1442-1451
  Chao Zhang; Lidan Shou; Ke Chen; Gang Chen; Yijun Bei
The emerging location-based social network (LBSN) services not only allow people to maintain cyber links with their friends, but also enable them to share the events happening on them at different locations. The geo-social correlations among event participants make it possible to quantify mutual user influence for various events. Such a quantification of influence could benefit a wide spectrum of real-life applications such as targeted advertising and viral marketing.
   In this paper, we perform an in-depth analysis of the geo-social correlations among LBSN users at event level, based on which we address two problems: user influence evaluation and influential events discovery. To capture the geo-social closeness between LBSN users, we propose a unified influence metric. This metric combines a novel social proximity measure named penalized hitting time, with a geographical weight function modeled by power law distribution. We propose two approximate algorithms, namely global iteration (GI) and dynamic neighborhood expansion (DNE), to efficiently evaluate user influence with tight theoretical error bounds. We then adopt the sampling technique and the threshold algorithm to support efficient retrieval of top-K influential events. Extensive experiments on both real-life and synthetic LBSN data sets confirm that the proposed algorithms are effective, efficient, and scalable.
The walls have ears: optimize sharing for visibility and privacy in online social networks BIBAFull-Text 1452-1461
  Thang N. Dinh; Yilin Shen; My T. Thai
With a rapid expansion of online social networks (OSNs), millions of users are tweeting and sharing their personal status daily without being aware of where that information eventually travels to. Likewise, with a huge magnitude of data available on OSNs, it poses a substantial challenge to track how a piece of information leaks to specific targets. In this paper, we study the problem of smartly sharing information to control the propagation of sensitive information in OSNs.
   In particular, we formulate and investigate the Maximum Circle of Trust problem of which we seek to construct a circle of trust on the fly so that OSN users can safely share their information knowing that it will not be propagated to their unwanted targets (whom they are not willing to share with). Since most of messages in OSNs are propagated within 2 to 5 hops, we first investigate this problem under 2-hop information propagation by showing the hardness of obtaining an optimal solution, along with an algorithm with proven performance guarantee. In a general case where information can be propagated more than two hops, the problem is #P-hard i.e. the problem cannot be solved in a polynomial time. Thus we propose a novel greedy algorithm, hybridizing the handy but costly sampling method with a novel cut-based estimation. The quality of the hybrid algorithm is comparable to that of the sampling method while taking only a tiny fraction of the time. We have validated the effectiveness of our solutions in many real-world traces. Such an extensive experiment also highlights several important observations on information leakage which help to sharpen the security of OSNs in the future.

Knowledge management short paper session

Influence and similarity on heterogeneous networks BIBAFull-Text 1462-1466
  Guan Wang; Qingbo Hu; Philip S. Yu
In the social network research, the studies on social influence maximization and entity similarity are two important and orthogonal tasks. On homogeneous networks, social influence maximization research tries to identify an initial influential set that maximizes the spread of the information, while similarity studies focus on designing meaningful ways to quantify entities' similarities. When heterogeneous networks are becoming ubiquitous and entities of different types are related to each other, we observe the possibility of merging the two directions together to improve the performance for both of them. In fact, we found that influence values among one type of nodes and similarity scores among the other type of nodes reinforce each other towards better and more meaningful results.
   Therefore, we introduce a framework that computes social influence for one type of nodes and simultaneously measures similarity of the other type of nodes in a heterogeneous network. First, we decouple the target heterogeneous network (or we call it Influence Similarity (IS) network) into three different parts: Influence network, Similarity network and information tunnels (IT) between them. Through IT, we exchange the influence scores and the similarity scores to calculate more precise similarity and influence scores in order to improve both of their qualities. The experiment results on real world data shows that our framework enables influence maximization framework to identify more influential seeds in Influence network and similarity measures to produce more meaningful similarity scores in Similarity network simultaneously.
GRAFT: an approximate graphlet counting algorithm for large graph analysis BIBAFull-Text 1467-1471
  Mahmudur Rahman; Mansurul Bhuiyan; Mohammad Al Hasan
Graphlet frequency distribution (GFD) is an analysis tool for understanding the variance of local structure in a graph. Many recent works use GFD for comparing, and characterizing real-life networks. However, the main bottleneck for graph analysis using GFD is the excessive computation cost for obtaining the frequency of each of the graphlets in a large network. To overcome this, we propose a simple, yet powerful algorithm, called GRAFT, that obtains the approximate graphlet frequency for all graphlets that have upto 5 vertices. Comparing to an exact counting algorithm, our algorithm achieves a speedup factor between 10 and 100 for a negligible counting error, which is, on average, less than 5%; For example, exact graphlet counting for ca-AstroPh takes approximately 3 days; but, GRAFT runs for 45 minutes to perform the same task with a counting accuracy of 95.6%.
Hierarchical co-clustering based on entropy splitting BIBAFull-Text 1472-1476
  Wei Cheng; Xiang Zhang; Feng Pan; Wei Wang
Two dimensional contingency tables or co-occurrence matrices arise frequently in various important applications such as text analysis and web-log mining. As a fundamental research topic, co-clustering aims to generate a meaningful partition of the contingency table to reveal hidden relationships between rows and columns. Traditional co-clustering algorithms usually produce a predefined number of flat partition of both rows and columns, which do not reveal relationship among clusters. To address this limitation, hierarchical co-clustering algorithms have attracted a lot of research interests recently. Although successful in various applications, the existing hierarchial co-clustering algorithms are usually based on certain heuristics and do not have solid theoretical background.
   In this paper, we present a new co-clustering algorithm with solid information theoretic background. It simultaneously constructs a hierarchical structure of both row and column clusters which retains sufficient mutual information between rows and columns of the contingency table. An efficient and effective greedy algorithm is developed which grows a co-cluster hierarchy by successively performing row-wise or column-wise splits that lead to the maximal mutual information gain. Extensive experiments on real datasets demonstrate that our algorithm can reveal essential relationships of row (and column) clusters and has better clustering precision than existing algorithms.
Mining long-lasting exploratory user interests from search history BIBAFull-Text 1477-1481
  Bin Tan; Yuanhua Lv; ChengXiang Zhai
A user's web search history contains many valuable search patterns. In this paper, we study search patterns that represent a user's long-lasting and exploratory search interests. By focusing on long-lastingness and exploratoriness, we are able to discover search patterns that are most useful for recommending new and relevant information to the user. Our approach is based on language modeling and clustering, and specifically designed to handle web search logs. We run our algorithm on a real web search log collection, and evaluate its performance using a novel simulated study on the same search log dataset. Experiment results support our hypothesis that long-lastingness and exploratoriness are necessary for generating successful recommendation. Our algorithm is shown to effectively discover such search interest patterns, and thus directly useful for making recommendation based on personal search history.
Feature selection based on term frequency and T-test for text categorization BIBAFull-Text 1482-1486
  Deqing Wang; Hui Zhang; Rui Liu; Weifeng Lv
Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on t-test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or slightly better than the state-of-the-art feature selection methods (i.e., chi², and IG) in terms of macro-F1 and micro-F1.
Adapting vector space model to ranking-based collaborative filtering BIBAFull-Text 1487-1491
  Shuaiqiang Wang; Jiankai Sun; Byron J. Gao; Jun Ma
Collaborative filtering (CF) is an effective technique addressing the information overload problem. Recently ranking-based CF methods have shown advantages in recommendation accuracy, being able to capture the preference similarity between users even if their rating scores differ significantly. In this study, we seek accuracy improvement of ranking-based CF through adaptation of the vector space model, where we consider each user as a document and her pairwise relative preferences as terms. We then use a novel degree-specialty weighting scheme resembling TF-IDF to weight the terms. Then we use cosine similarity to select a neighborhood of users for the target user to make recommendations. Experiments on benchmarks in comparison with the state-of-the-art methods demonstrate the promise of our approach.
Joint relevance and answer quality learning for question routing in community QA BIBAFull-Text 1492-1496
  Guangyou Zhou; Kang Liu; Jun Zhao
Community question answering (cQA) has become a popular service for users to ask and answer questions. In recent years, the efficiency of cQA service is hindered by a sharp increase of questions in the community. This paper is concerned with the problem of question routing. Question routing in cQA aims to route new questions to the eligible answerers who can give high quality answers. However, the traditional methods suffer from the following two problems: (1) word mismatch between the new questions and the users' answering history; (2) high variance in perceived answer quality. To solve the above two problems, this paper proposes a novel joint learning method by taking both word mismatch and answer quality into a unified framework for question routing. We conduct experiments on large-scale real world data set from Yahoo! Answers. Experimental results show that our proposed method significantly outperforms the traditional query likelihood language model (QLLM) as well as state-of-the-art cluster-based language model (CBLM) and category-sensitive query likelihood language model (TCSLM).
Fast approximation of steiner trees in large graphs BIBAFull-Text 1497-1501
  Andrey Gubichev; Thomas Neumann
Finding the minimum connected subtree of a graph that contains a given set of nodes (i.e., the Steiner tree problem) is a fundamental operation in keyword search in graphs, yet it is known to be NP-hard. Existing approximation techniques either make use of the heavy indexing of the graph, or entirely rely on online heuristics.
   In this paper we bridge the gap between these two extremes and present a scalable landmark-based index structure that, combined with a few lightweight online heuristics, yields a fast and accurate approximation of the Steiner tree.
   Our solution handles real-world graphs with millions of nodes and provides an approximation error of less than 5% on average.
Automatically embedding newsworthy links to articles BIBAFull-Text 1502-1506
  Hakan Ceylan; Ioannis Arapakis; Pinar Donmez; Mounia Lalmas
It is of great interest to news providers such as Yahoo! News to attain higher visitor rates by promoting greater engagement with their content. One aspect of engagement deals with keeping users on the site longer by allowing them to navigate through content with enhanced, click-through experiences. News portals have invested in ways to provide embedded links within news stories. So far these links have been manually curated by professional editors, and due to the manual effort involved, the use of such links has been limited. In this paper we propose an automated approach to detecting and linking newsworthy events to associated articles. Our analysis, conducted on Amazon's Mechanical Turk, reveals that our system's performance is comparable to that of professional editors, and that users find the automatically generated highlights interesting and the associated articles worthy of reading.
Learning spectral embedding via iterative eigenvalue thresholding BIBAFull-Text 1507-1511
  Fanhua Shang; L. C. Jiao; Yuanyuan Liu; Fei Wang
Learning data representation is a fundamental problem in data mining and machine learning. Spectral embedding is one popular method for learning effective data representations. In this paper we propose a novel framework to learn enhanced spectral embedding, which not only considers the geometrical structure of the data space, but also takes advantage of the given pairwise constraints. The proposed formulation can be solved by an iterative eigenvalue thresholding (IET) algorithm. Specially, we convert the problem of learning spectral embedding with pairwise constraints into the one of completing an "ideal" kernel matrix. And we introduce the spectral embedding of graph Laplacian as the auxiliary information and cast it as a small-scale positive semidefinite (PSD) matrix optimization problem with nuclear norm regularization. Then, we develop an IET algorithm to solve it efficiently. Moreover, we also present an effective semi-supervised clustering (SSC) approach with learned spectral embedding (LSE). Finally, we validate the proposed IET algorithm and LSE approach by extensive experiments on real-world data sets.
Measuring robustness of complex networks under MVC attack BIBAFull-Text 1512-1516
  Rong-Hua Li; Jeffrey Xu Yu; Xin Huang; Hong Cheng; Zechao Shang
Measuring robustness of complex networks is a fundamental task for analyzing the structure and function of complex networks. In this paper, we study the network robustness under the maximal vertex coverage (MVC) attack, where the attacker aims to delete as many edges of the network as possible by attacking a small fraction of nodes. First, we present two robustness metrics of complex networks based on MVC attack. We then propose an efficient randomized greedy algorithm with near-optimal performance guarantee for computing the proposed metrics. Finally, we conduct extensive experiments on 20 real datasets. The results show that P2P and co-authorship networks are extremely robust under the MVC attack while both the online social networks and the Email communication networks exhibit vulnerability under the MVC attack. In addition, the results demonstrate the efficiency and effectiveness of our proposed algorithms for computing the corresponding robustness metrics.
A simple approach to the design of site-level extractors using domain-centric principles BIBAFull-Text 1517-1521
  Chong Long; Xiubo Geng; Chang Xu; Sathiya Keerthi
We consider the problem of extracting, in a domain-centric fashion, a given set of attributes from a large number of semi-structured websites. Previous approaches [7, 5] to solve this problem are based on page level inference. We propose a distinct new approach that directly chooses attribute extractors for a site using a scoring mechanism that is designed at the domain level via simple classification methods using a training set from a small number of sites. To keep the number of candidate extractors in each site manageably small we use two observations that hold in most domains: (a) imprecise annotators can be used to identify a small set of candidate extractors for a few attributes (anchors); and (b) non-anchor attributes lie in close proximity to the anchor attributes. Experiments on three domains (Events, Books and Restaurants) show that our approach is very effective in spite of its simplicity.
Extraction of topic evolutions from references in scientific articles and its GPU acceleration BIBAFull-Text 1522-1526
  Tomonari Masada; Atsuhiro Takasu
This paper provides a topic model for extracting topic evolutions as a corpus-wide transition matrix among latent topics. Recent trends in text mining point to a high demand for exploiting metadata. Especially, exploitation of reference relationships among documents induced by hyperlinking Web pages, citing scientific articles, tumblring blog posts, retweeting tweets, etc., is put in the foreground of the effort for an effective mining. We focus on scholarly activities and propose a topic model for obtaining a corpus-wide view on how research topics evolve along citation relationships. Our model, called TERESA, extends latent Dirichlet allocation (LDA) by introducing a corpus-wide topic transition probability matrix, which models reference relationships as transitions among topics. Our approximated variational inference updates LDA posteriors and topic transition posteriors alternately. The main issue is execution time amounting to O(MK2), where K is the number of topics and M is that of links in citation network. Therefore, we accelerate the inference with Nvidia CUDA compatible GPUs. We compare the effectiveness of TERESA with that of LDA by introducing a new measure called diversity plus focusedness (D+F). We also present topic evolution examples our method gives.
Graph-based workflow recommendation: on improving business process modeling BIBAFull-Text 1527-1531
  Bin Cao; Jianwei Yin; Shuiguang Deng; Dongjing Wang; Zhaohui Wu
How to improve the modeling efficiency and accuracy has become a burning problem. The popularization of recommendation technique in E-Commerce provide us new trajectories that can be used for addressing the problem. In this paper, we propose a graph-based workflow recommendation for improving business process modeling. The start point is so-called "workflow repository" including a set of already developed process models. Graph mining method is used to extract the process patterns from the repository. Based on graph edit distance (GED) [2], we calculate the distance between patterns and the partial business process, viewed as reference model, which is under modeling and select the candidate nodes with smaller distances for recommendation. The performance study show its feasibility for practical uses.
Reconciling ontologies and the web of data BIBAFull-Text 1532-1536
  Ziawasch Abedjan; Johannes Lorey; Felix Naumann
To integrate Linked Open Data, which originates from various and heterogeneous sources, the use of well-defined ontologies is essential. However, oftentimes the utilization of these ontologies by data publishers differs from the intended application envisioned by ontology engineers. This may lead to unspecified properties being used ad-hoc as predicates in RDF triples or it may result in infrequent usage of specified properties. These mismatches impede the goals and propagation of the Web of Data as data consumers face difficulties when trying to discover and integrate domain-specific information. In this work, we identify and classify common misusage patterns by employing frequency analysis and rule mining. Based on this analysis, we introduce an algorithm to propose suggestions for a data-driven ontology re-engineering workflow, which we evaluate on two large-scale RDF datasets.
Efficient extraction of ontologies from domain specific text corpora BIBAFull-Text 1537-1541
  Tianyu Li; Pirooz Chubak; Laks V. S. Lakshmanan; Rachel Pottinger
Extracting ontological relationships (e.g., ISA and HASA) from free-text repositories (e.g., engineering documents and instruction manuals) can improve users' queries, as well as benefit applications built for these domains.
   Current methods to extract ontologies from text usually miss many meaningful relationships because they either concentrate on single-word terms and short phrases or neglect syntactic relationships between concepts in sentences.
   We propose a novel pattern-based algorithm to find ontological relationships between complex concepts by exploiting parsing information to extract multi-word concepts and nested concepts. Our procedure is iterative: we tailor the constrained sequential pattern mining framework to discover new patterns. Our experiments on three real data sets show that our algorithm consistently and significantly outperforms previous representative ontology extraction algorithms.
Effective and efficient?: bilingual sentiment lexicon extraction using collocation alignment BIBAFull-Text 1542-1546
  Zheng Lin; Songbo Tan; Xueqi Cheng; Xueke Xu; Weisong Shi
Bilingual sentiment lexicon is fundamental resource for cross-language sentiment analysis but its compilation remains a major bottleneck in computational linguistics. Traditional word alignment algorithm faces with the status of large alignment space, which may introduce redundant computations as well as alignment errors. In this paper, we use collocation alignment to extract bilingual sentiment lexicon overcoming the drawbacks of word alignment. The idea of collocation alignment is inspired by the strong cohesion between feature words and opinion words in sentiment corpus. Experimental results show that our approach not only decreases the computing time dramatically but also improves the precision of extracted bilingual word pairs due to the smaller alignment space.
Exploiting latent relevance for relational learning of ubiquitous things BIBAFull-Text 1547-1551
  Lina Yao; Quan Z. Sheng
With recent advances in radio-frequency identification (RFID), wireless sensor networks, and Web services, physical things are becoming an integral part of the emerging ubiquitous Web. While this integration offers many exciting opportunities such as efficient supply chains and improved environmental monitoring, it also presents many significant challenges. One such challenge lies in how to classify, discover, and manage ubiquitous things, which is critical for efficient and effective object search, recommendation, and composition. In this paper, we focus on automatically classifying ubiquitous things into manageable semantic category labels by exploiting the information hidden in interactions between users and ubiquitous things. We develop a novel approach to extract latent relevance by building a relational network of ubiquitous things (RNUbiT) where similar things are linked via virtual edges according to their latent relevance. A discriminative learning algorithm is also developed to automatically determine category labels for ubiquitous things. We conducted experiments using real-world data and the experimental results demonstrate the feasibility and validity of our proposed approach.
Discovering personally semantic places from GPS trajectories BIBAFull-Text 1552-1556
  Mingqi Lv; Ling Chen; Gencai Chen
A place is a locale that is frequently visited by an individual user and carries important semantic meanings (e.g. home, work, etc.). Many location-aware applications will be greatly enhanced with the ability of the automatic discovery of personally semantic places. The discovery of a user's personally semantic places involves obtaining the physical locations and semantic meanings of these places. In this paper, we propose approaches to address both of the problems. For the physical place extraction problem, a hierarchical clustering algorithm is proposed to firstly extract visit points from the GPS trajectories, and then these visit points can be clustered to form physical places. For the semantic place recognition problem, Bayesian networks (encoding the temporal patterns in which the places are visited) are used in combination with a customized POI (i.e. place of interest) database (containing the spatial features of the places) to categorize the extracted physical places into pre-defined types. An extensive set of experiments have been conducted to demonstrate the effectiveness of the proposed approaches based on a dataset of real-world GPS trajectories.
Mining coherent anomaly collections on web data BIBAFull-Text 1557-1561
  Hanbo Dai; Feida Zhu; Ee-Peng Lim; HweeHwa Pang
The recent boom of weblogs and social media has attached increasing importance to the identification of suspicious users with unusual behavior, such as spammers or fraudulent reviewers. A typical spamming strategy is to employ multiple dummy accounts to collectively promote a target, be it a URL or a product. Consequently, these suspicious accounts exhibit certain coherent anomalous behavior identifiable as a collection. In this paper, we propose the concept of Coherent Anomaly Collection (CAC) to capture this kind of collections, and put forward an efficient algorithm to simultaneously find the top-K disjoint CACs together with their anomalous behavior patterns. Compared with existing approaches, our new algorithm can find disjoint anomaly collections with coherent extreme behavior without having to specify either their number or sizes. Results on real Twitter data show that our approach discovers meaningful and informative hashtag spammer groups of various sizes which are hard to detect by clustering-based methods.
Mining topic-level opinion influence in microblog BIBAFull-Text 1562-1566
  Daifeng Li; Xin Shuai; Guozheng Sun; Jie Tang; Ying Ding; Zhipeng Luo
This paper proposes a Topic-Level Opinion Influence Model (TOIM) that simultaneously incorporates topic factor, user opinions and social influence in a unified probabilistic model with two stages learning processes. In the first stage, topic factor and user influence are integrated to generate users' influential relationship based on different topics; in the second stage, users' historical messages and social interaction records are leveraged by TOIM to construct their historical opinions and neighbors' opinion influence through a statistical learning process, which can be further utilized to predict users' future opinions on some specific topics. We evaluate our TOIM on a large-scaled dataset from Tencent Weibo, one of the largest microbloggings website in China. The experimental results show that TOIM can better predict users' opinion than other baseline methods.
Meta path-based collective classification in heterogeneous information networks BIBAFull-Text 1567-1571
  Xiangnan Kong; Philip S. Yu; Ying Ding; David J. Wild
Collective classification approaches exploit the dependencies of a group of linked objects whose class labels are correlated and need to be predicted simultaneously. In this paper, we focus on studying the collective classification problem in heterogeneous networks, which involves multiple types of data objects interconnected by multiple types of links. Intuitively, two objects are correlated if they are linked by many paths in the network. By considering different linkage paths in the network, one can capture the subtlety of different types of dependencies among objects. We introduce the concept of meta-path based dependencies among objects, where a meta path is a path consisting a certain sequence of linke types. We show that the quality of collective classification results strongly depends upon the meta paths used. To accommodate the large network size, a novel solution, called HCC (meta-path based Heterogenous Collective Classification), is developed to effectively assign labels to a group of instances that are interconnected through different meta-paths. The proposed HCC model can capture different types of dependencies among objects with respect to different meta paths. Empirical studies on real-world networks demonstrate that effectiveness of the proposed meta path-based collective classification approach.
Discretionary social network data revelation with a user-centric utility guarantee BIBAFull-Text 1572-1576
  Yi Song; Panagiotis Karras; Sadegh Nobari; Giorgos Cheliotis; Mingqiang Xue; Stéphane Bressan
The proliferation of online social networks has created intense interest in studying their nature and revealing information of interest to the end user. At the same time, such revelation raises privacy concerns. Existing research addresses this problem following an approach popular in the database community: a model of data privacy is defined, and the data is rendered in a form that satisfies the constraints of that model while aiming to maximize some utility measure. Still, these is no consensus on a clear and quantifiable utility measure over graph data. In this paper, we take a different approach: we define a utility guarantee, in terms of certain graph properties being preserved, that should be respected when releasing data, while otherwise distorting the graph to an extend desired for the sake of confidentiality. We propose a form of data release which builds on current practice in social network platforms: A user may want to see a subgraph of the network graph, in which that user as well as connections and affiliates participate. Such a snapshot should not allow malicious users to gain private information, yet provide useful information for benevolent users. We propose a mechanism to prepare data for user view under this setting. In an experimental study with real data, we demonstrate that our method preserves several properties of interest more successfully than methods that randomly distort the graph to an equal extent, while withstanding structural attacks proposed in the literature.
Empirical validation of the Buckley-Osthus model for the web host graph: degree and edge distributions BIBAFull-Text 1577-1581
  Maxim Zhukovskiy; Dmitry Vinogradov; Yuri Pritykin; Liudmila Ostroumova; Evgeniy Grechnikov; Gleb Gusev; Pavel Serdyukov; Andrei Raigorodskii
We consider the Buckley-Osthus implementation of preferential attachment and its ability to model the web host graph in two aspects. One is the degree distribution that we observe to follow the power law, as often being the case for real-world graphs. Another one is the two-dimensional edge distribution, the number of edges between vertices of given degrees. We fit a single "initial attractiveness" parameter a of the model, first with respect to the degree distribution of the web host graph, and then, absolutely independently, with respect to the edge distribution. Surprisingly, the values of a we obtain turn out to be nearly the same. Therefore the same model with the same value of the parameter a fits very well the two independent and basic aspects of the web host graph. In addition, we demonstrate that other models completely lack the asymptotic behavior of the edge distribution of the web host graph, even when accurately capturing the degree distribution.
   To the best of our knowledge, this is the first study confirming the ability of preferential attachment models to reflect the distribution of edges between vertices with respect to their degrees in a real graph of Internet.
gSCorr: modeling geo-social correlations for new check-ins on location-based social networks BIBAFull-Text 1582-1586
  Huiji Gao; Jiliang Tang; Huan Liu
Location-based social networks (LBSNs) have attracted an increasing number of users in recent years. The availability of geographical and social information of online LBSNs provides an unprecedented opportunity to study the human movement from their socio-spatial behavior, enabling a variety of location-based services. Previous work on LBSNs reported limited improvements from using the social network information for location prediction; as users can check-in at new places, traditional work on location prediction that relies on mining a user's historical trajectories is not designed for this "cold start" problem of predicting new check-ins. In this paper, we propose to utilize the social network information for solving the "cold start" location prediction problem, with a geo-social correlation model to capture social correlations on LBSNs considering social networks and geographical distance. The experimental results on a real-world LBSN demonstrate that our approach properly models the social correlations of a user's new check-ins by considering various correlation strengths and correlation measures.
Swimming against the Streamz: search and analytics over the enterprise activity stream BIBAFull-Text 1587-1591
  Ido Guy; Tal Steier; Maya Barnea; Inbal Ronen; Tal Daniel
Activity streams have become prevalent on the web and are starting to emerge in enterprises. In this work, we present Streamz, a novel application that uses a faceted search approach to provide employees with advanced capabilities of search, navigation, attention management, and other types of analytics on top of an enterprise activity stream. We provide a detailed description of the Streamz tool as well as usage analysis based on user interface logs and interviews of active users.
What is happening right now ... that interests me?: online topic discovery and recommendation in Twitter BIBAFull-Text 1592-1596
  Ernesto Diaz-Aviles; Lucas Drumond; Zeno Gantner; Lars Schmidt-Thieme; Wolfgang Nejdl
Users engaged in the Social Web increasingly rely upon continuous streams of Twitter messages (tweets) for real-time access to information and fresh knowledge about current affairs. However, given the deluge of tweets, it is a challenge for individuals to find relevant and appropriately ranked information. We propose to address this knowledge management problem by going beyond the general perspective of information finding in Twitter, that asks: "What is happening right now?", towards an individual user perspective, and ask: "What is interesting to me right now?" In this paper, we consider collaborative filtering as an online ranking problem and present RMFO, a method that creates, in real-time, user-specific rankings for a set of tweets based on individual preferences that are inferred from the user's past system interactions. Experiments on the 476 million Twitter tweets dataset show that our online approach largely outperforms recommendations based on Twitter's global trend and Weighted Regularized Matrix Factorization (WRMF), a highly competitive state-of-the-art Collaborative Filtering technique, demonstrating the efficacy of our approach.
Frequent grams based embedding for privacy preserving record linkage BIBAFull-Text 1597-1601
  Luca Bonomi; Li Xiong; Rui Chen; Benjamin C. M. Fung
In this paper, we study the problem of privacy preserving record linkage which aims to perform record linkage without revealing anything about the non-linked records. We propose a new secure embedding strategy based on frequent variable length grams which allows record linkage on the embedded space. The frequent grams used for constructing the embedding base are mined from the original database under the framework of differential privacy. Compared with the state-of-the-art secure matching schema [15], our approach provides formal, provable privacy guarantees and achieves better scalability while providing comparable utility.
If you are happy and you know it... tweet BIBAFull-Text 1602-1606
  A Amir Asiaee T.; Mariano Tepper; Arindam Banerjee; Guillermo Sapiro
Extracting sentiment from Twitter data is one of the fundamental problems in social media analytics. Twitter's length constraint renders determining the positive/negative sentiment of a tweet difficult, even for a human judge. In this work we present a general framework for per-tweet (in contrast with batches of tweets) sentiment analysis which consists of: (1) extracting tweets about a desired target subject, (2) separating tweets with sentiment, and (3) setting apart positive from negative tweets. For each step, we study the performance of a number of classical and new machine learning algorithms. We also show that the intrinsic sparsity of tweets allows performing classification in a low dimensional space, via random projections, without losing accuracy. In addition, we present weighted variants of all employed algorithms, exploiting the available labeling uncertainty, which further improve classification accuracy. Finally, we show that spatially aggregating our per-tweet classification results produces a very satisfactory outcome, making our approach a good candidate for batch tweet sentiment analysis.
PRemiSE: personalized news recommendation via implicit social experts BIBAFull-Text 1607-1611
  Chen Lin; Runquan Xie; Lei Li; Zhenhua Huang; Tao Li
A variety of news recommender systems based on different strategies have been proposed to provide news personalization services for online news readers. However, little research work has been reported on utilizing the implicit "social" factors (i.e., the potential influential experts in news reading community) among news readers to facilitate news personalization. In this paper, we investigate the feasibility of integrating content-based methods, collaborative filtering and information diffusion models by employing probabilistic matrix factorization techniques. We propose PRemiSE, a novel Personalized news Recommendation framework via implicit Social Experts, in which the opinions of potential influencers on virtual social networks extracted from implicit feedbacks are treated as auxiliary resources for recommendation. Empirical results demonstrate the efficacy and effectiveness of our method, particularly, on handling the so-called cold-start problem.
Hierarchical topic integration through semi-supervised hierarchical topic modeling BIBAFull-Text 1612-1616
  Xian-Ling Mao; Jing He; Hongfei Yan; Xiaoming Li
Lots of document collections are well organized in hierarchical structure, and such structure can help users browse and understand these collections. Meanwhile, there are a large number of plain document collections loosely organized, and it is difficult for users to understand them effectively. In this paper we study how to automatically integrate latent topics in a plain collection with the topics in a hierarchical structured collection. We propose to use semi-supervised topic modeling to solve the problem in a principled way. The experiments show that the proposed method can generate both meaningful latent topics and expand high quality hierarchical topic structures.
Exploiting enriched contextual information for mobile app classification BIBAFull-Text 1617-1621
  Hengshu Zhu; Huanhuan Cao; Enhong Chen; Hui Xiong; Jilei Tian
A key step for the mobile app usage analysis is to classify apps into some predefined categories. However, it is a nontrivial task to effectively classify mobile apps due to the limited contextual information available for the analysis. To this end, in this paper, we propose an approach to first enrich the contextual information of mobile apps by exploiting the additional Web knowledge from the Web search engine. Then, inspired by the observation that different types of mobile apps may be relevant to different real-world contexts, we also extract some contextual features for mobile apps from the context-rich device logs of mobile users. Finally, we combine all the enriched contextual information into a Maximum Entropy model for training a mobile app classifier. The experimental results based on 443 mobile users' device logs clearly show that our approach outperforms two state-of-the-art benchmark methods with a significant margin.
Incorporating word correlation into tag-topic model for semantic knowledge acquisition BIBAFull-Text 1622-1626
  Fang Li; Tingting He; Xinhui Tu; Xiaohua Hu
This paper presents a tag-topic model with Dirichlet Forest prior (TTM-DF) for semantic knowledge acquisition from blog. The TTM-DF model extends the tag-topic model (TTM) by replacing the Dirichlet prior with the Dirichlet Forest prior over the topic-word multinomial. The correlation between words are calculated to generate a set of Must-Links and Cannot-Links, then the structures of Dirichlet trees are obtained though encoding the constraints of Must-Links and Cannot-Links. Words under the same subtrees are expected to be more correlated than words under different subtrees. We conduct experiments on a synthetic and a blog dataset. Both of the experimental results show that the TTM-DF model performs much better than the TTM model. It can improve the coherence of the underlying topics and the tag-topic distributions, and capture semantic knowledge effectively.
PriSM: discovering and prioritizing severe technical issues from product discussion forums BIBAFull-Text 1627-1631
  Rashmi Gangadharaiah; Rose Catherine
Online forums provide a channel for users to report and discuss problems related to products and troubleshooting, for faster resolution. These could garner negative publicity if left unattended by the companies. Manually monitoring these massive amounts of discussions is laborious. This paper makes the first attempt at collecting issues that require immediate action by the product supplier by analyzing the immense information on forums. Features that are specific to forum discussions, in conjunction with linguistic cues help in capturing and better prioritizing issues. Any attempt to collect training data for learning a classifier for this task will require enormous labeling effort. Hence, this paper adopts a co-training approach, which uses minimal manual labeling, coupled with linguistic features extracted using a set-expansion algorithm to discover severe problems. Further, most distinct and recent issues are obtained by incorporating a measure of 'centrality', 'diversity' and temporal aspect of the forum threads. We show that this helps in better prioritizing longstanding issues and identify issues that need to be addressed immediately.
Preprocessing of informal mathematical discourse in context of controlled natural language BIBAFull-Text 1632-1636
  Raúl Ernesto Gutiérrez de Piñerez Reyes; Juan Francisco Díaz Frías
Informal Mathematical Discourse (IMD) is characterized by the mixture of natural language and symbolic expressions in the context of textbooks, publications in mathematics and mathematical proof. We focused the IMD processing at the low level of discourse. In this paper, we proposed the preprocessing phase before the IMD structure analysis within the context of Controlled Natural Language (CNL). Our contribution is defined in context of the IMD processing and the use of machine learning; first, we present a CNL, a pure corpus and Matemathical Treebank for processing IMD; second, we present a preprocessing phase for IMD analysis with connectives disambiguation and verbs treatment, finally, we found a satisfactory result on input text parsing using a statistical parsing model. We will propagate these results for classification of argumentative informal practices via the low level discourse in IMD processing.
PathRank: a novel node ranking measure on a heterogeneous graph for recommender systems BIBAFull-Text 1637-1641
  Sangkeun Lee; Sungchan Park; Minsuk Kahng; Sang-goo Lee
In this paper, we present a novel random-walk based node ranking measure, PathRank, which is defined on a heterogeneous graph by extending the Personalized PageRank algorithm. Not only can our proposed measure exploit the semantics behind the different types of nodes and edges in a heterogeneous graph, but also it can emulate various recommendation semantics such as collaborative filtering, content-based filtering, and their combinations. The experimental results show that PathRank can produce more various and effective recommendation results compared to existing approaches.
Unsupervised discovery of opposing opinion networks from forum discussions BIBAFull-Text 1642-1646
  Yue Lu; Hongning Wang; ChengXiang Zhai; Dan Roth
With more and more people freely express opinions as well as actively interact with each other in discussion threads, online forums are becoming a gold mine with rich information about people's opinions and social behaviors. In this paper, we study an interesting new problem of automatically discovering opposing opinion networks of users from forum discussions, which are subset of users who are strongly against each other on some topic. Toward this goal, we propose to use signals from both textual content (e.g., who says what) and social interactions (e.g., who talks to whom) which are both abundant in online forums. We also design an optimization formulation to combine all the signals in an unsupervised way. We created a data set by manually annotating forum data on five controversial topics and our experimental results show that the proposed optimization method outperforms several baselines and existing approaches, demonstrating the power of combining both text analysis and social network analysis in analyzing and generating the opposing opinion networks.
Exploring the existing category hierarchy to automatically label the newly-arising topics in cQA BIBAFull-Text 1647-1651
  Guangyou Zhou; Li Cai; Kang Liu; Jun Zhao
This work investigates selecting concise labels for the newly-arising topics in community question answer. Previous methods of generating labels do not take the information of the existing category hierarchy into consideration. The main motivation of our paper is to utilize this information into the label generation process. We propose a general framework to address this problem. Firstly, we map the questions into Wikipedia concept sets, which are more meaningful than terms. Secondly, important concepts are identified to represent the main focus of the newly-arising topics. Thirdly, candidate labels are extracted from Wikipedia category graph. Finally, candidate labels are filtered and reranked by combination of structure information of existing category hierarchy and Wikipedia category graph. The experiments show that in our test collections, about 80% "correct" labels appear in the top ten labels recommended by our system.
Query-focused multi-document summarization based on query-sensitive feature space BIBAFull-Text 1652-1656
  Wenpeng Yin; Yulong Pei; Fan Zhang; Lian'en Huang
Query-oriented relevance, information richness and novelty are important requirements in query-focused summarization, which, to a considerable extent, determine the summary quality. Previous work either rarely took into account all above demands simultaneously or dealt with part of them in the dynamic process of choosing sentences to generate a summary. In this paper, we propose a novel approach that integrates all these requirements skillfully by treating them as sentence features, making that the finally generated summary could fully reflect the combinational effect of these properties. Experimental results on the DUC2005 and DUC2006 datasets demonstrate the effectiveness of our approach.
Time-aware topic recommendation based on micro-blogs BIBAFull-Text 1657-1661
  Huizhi Liang; Yue Xu; Dian Tjondronegoro; Peter Christen
Topic recommendation can help users deal with the information overload issue in micro-blogging communities. This paper proposes to use the implicit information network formed by the multiple relationships among users, topics and micro-blogs, and the temporal information of micro-blogs to find semantically and temporally relevant topics of each topic, and to profile users' time-drifting topic interests. The Content based, Nearest Neighborhood based and Matrix Factorization models are used to make personalized recommendations. The effectiveness of the proposed approaches is demonstrated in the experiments conducted on a real world dataset that collected from Twitter.com.
Topic-sensitive probabilistic model for expert finding in question answer communities BIBAFull-Text 1662-1666
  Guangyou Zhou; Siwei Lai; Kang Liu; Jun Zhao
In this paper, we address the problem of expert finding in community question answering (CQA). Most of the existing approaches attempt to find experts in CQA by means of link analysis techniques. However, these traditional techniques only consider the link structure while ignore the topical similarity among users (askers and answerers) and user expertise and user reputation. In this study, we propose a topic-sensitive probabilistic model, which is an extension of PageRank algorithm to find experts in CQA. Compared to the traditional link analysis techniques, our proposed method is more effective because it finds the experts by taking into account both the link structure and the topical similarity among users. We conduct experiments on real world data set from Yahoo! Answers. Experimental results show that our proposed method significantly outperforms the traditional link analysis techniques and achieves the state-of-the-art performance for expert finding in CQA.
iSampling: framework for developing sampling methods considering user's interest BIBAFull-Text 1667-1671
  Jinoh Oh; Hwanjo Yu
Sampling is one of fundamental techniques for data preprocessing and mining. It helps to reduce computational costs and improve the mining quality. A sampling method is typically developed independently for a specific problem and for a specific user's interest, because it is hard to develop a method that is generalized across various user's interests. An absence of general framework for sampling makes it inefficient to develop or revise a sampling method as user's interest changes. This paper proposes a general framework, isampling, which facilitates a user developing sampling methods and easily modifying the user's sampling interest in the method. In the framework, a user explicitly describes her sampling interest into a graph model called interest model. Then, isampling automatically selects a sample set according to the model, which satisfies the user's interest. In order to demonstrate the effectiveness of our framework, we develop new trajectory sampling methods using our framework; trajectory sampling has been a challenging problem due to its high complexity of data and various user's interests. We demonstrate the flexibility of our framework by showing how easily trajectory samples of different interests can be generated within our framework.
WiSeNet: building a wikipedia-based semantic network with ontologized relations BIBAFull-Text 1672-1676
  Andrea Moro; Roberto Navigli
In this paper we present an approach for building a Wikipedia-based semantic network by integrating Open Information Extraction with Knowledge Acquisition techniques. Our algorithm extracts relation instances from Wikipedia page bodies and ontologizes them by, first, creating sets of synonymous relational phrases, called relation synsets, second, assigning semantic classes to the arguments of these relation synsets and, third, disambiguating the initial relation instances with relation synsets. As a result we obtain WiSeNet, a Wikipedia-based Semantic Network with Wikipedia pages as concepts and labeled, ontologized relations between them.
Shaping communities out of triangles BIBAFull-Text 1677-1681
  Arnau Prat-Pérez; David Dominguez-Sal; Josep M. Brunat; Josep-Lluis Larriba-Pey
Community detection has arisen as one of the most relevant topics in the field of graph data mining due to its importance in many fields such as biology, social networks or network traffic analysis. The metrics proposed to shape communities are too lax and do not consider the internal layout of the edges in the community, which lead to undesirable results. We define a new community metric called WCC. The proposed metric meets a minimum set of basic properties that guarantees communities with structure and cohesion. We experimentally show that WCC correctly quantifies the quality of communities and community partitions using real and synthetic datasets, and compare some of the most used community detection algorithms in the state of the art.
The early-adopter graph and its application to web-page recommendation BIBAFull-Text 1682-1686
  Ida Mele; Francesco Bonchi; Aristides Gionis
In this paper we present a novel graph-based data abstraction for modeling the browsing behavior of web users. The objective is to identify users who discover interesting pages before others. We call these users early adopters. By tracking the browsing activity of early adopters we can identify new interesting pages early, and recommend these pages to similar users. We focus on news and blog pages, which are more dynamic in nature and more appropriate for recommendation.
   Our proposed model is called early-adopter graph. In this graph, nodes represent users and a directed arc between users u and v expresses the fact that u and v visit similar pages and, in particular, that user u tends to visit those pages before user v. The weight of the edge is the degree to which the temporal rule "v visits a page before v" holds.
   Based on the early-adopter graph, we build a recommendation system for news and blog pages, which outperforms other out-of-the-shelf recommendation systems based on collaborative filtering.
Relational co-clustering via manifold ensemble learning BIBAFull-Text 1687-1691
  Ping Li; Jiajun Bu; Chun Chen; Zhanying He
Co-clustering targets on grouping the samples and features simultaneously. It takes advantage of the duality between the samples and features. In many real-world applications, the data points or features usually reside on a submanifold of the ambient Euclidean space, but it is nontrivial to estimate the intrinsic manifolds in a principled way. In this study, we focus on improving the co-clustering performance via manifold ensemble learning, which aims to maximally approximate the intrinsic manifolds of both the sample and feature spaces. To achieve this, we develop a novel co-clustering algorithm called Relational Multi-manifold Co-clustering (RMC) based on symmetric nonnegative matrix tri-factorization, which decomposes the relational data matrix into three matrices. This method considers the inter-type relationship revealed by the relational data matrix and the intra-type information reflected by the affinity matrices. Specifically, we assume the intrinsic manifold of the sample or feature space lies in a convex hull of a group of pre-defined candidate manifolds. We hope to learn an appropriate convex combination of them to approach the desired intrinsic manifold. To optimize the objective, the multiplicative rules are utilized to update the factorized matrices and the entropic mirror descent algorithm is exploited to automatically learn the manifold coefficients. Experimental results demonstrate the superiority of the proposed algorithm.
SemaFor: semantic document indexing using semantic forests BIBAFull-Text 1692-1696
  George Tsatsaronis; Iraklis Varlamis; Kjetil Nørvåg
Traditional document indexing techniques store documents using easily accessible representations, such as inverted indices, which can efficiently scale for large document sets. These structures offer scalable and efficient solutions in text document management tasks, though, they omit the cornerstone of the documents' purpose: meaning. They also neglect semantic relations that bind terms into coherent fragments of text that convey messages. When semantic representations are employed, the documents are mapped to the space of concepts and the similarity measures are adapted appropriately to better fit the retrieval tasks. However, these methods can be slow both at indexing and retrieval time. In this paper we propose SemaFor, an indexing algorithm for text documents, which uses semantic spanning forests constructed from lexical resources, like Wikipedia, and WordNet, and spectral graph theory in order to represent documents for further processing.
Measuring website similarity using an entity-aware click graph BIBAFull-Text 1697-1701
  Pablo N. Mendes; Peter Mika; Hugo Zaragoza; Roi Blanco
Query logs record the actual usage of search systems and their analysis has proven critical to improving search engine functionality. Yet, despite the deluge of information, query log analysis often suffers from the sparsity of the query space. Based on the observation that most queries pivot around a single entity that represents the main focus of the user's need, we propose a new model for query log data called the entity-aware click graph. In this representation, we decompose queries into entities and modifiers, and measure their association with clicked pages. We demonstrate the benefits of this approach on the crucial task of understanding which websites fulfill similar user needs, showing that using this representation we can achieve a higher precision than other query log-based approaches.
Community-based classification of noun phrases in Twitter BIBAFull-Text 1702-1706
  Freddy Chong Tat Chua; William W. Cohen; Justin Betteridge; Ee-Peng Lim
Many event monitoring systems rely on counting known keywords in streaming text data to detect sudden spikes in frequency. But the dynamic and conversational nature of Twitter makes it hard to select known keywords for monitoring. Here we consider a method of automatically finding noun phrases (NPs) as keywords for event monitoring in Twitter. Finding NPs has two aspects, identifying the boundaries for the subsequence of words which represent the NP, and classifying the NP to a specific broad category such as politics, sports, etc. To classify an NP, we define the feature vector for the NP using not just the words but also the author's behavior and social activities. Our results show that we can classify many NPs by using a sample of training data from a knowledge-base.
Real-time bid optimization for group-buying ads BIBAFull-Text 1707-1711
  Raju Balakrishnan; Rushi P. Bhatt
Group-buying ads seeking a minimum number of customers before the deal expiry are increasingly used by the daily-deal providers. Unlike the traditional web ads, the advertiser's profits for group-buying ads depends on the time to expiry and additional customers needed to satisfy the minimum group size. Since both these quantities are time-dependent, optimal bid amounts to maximize profits change with every impression. Consequently, traditional static bidding strategies are far from optimal. Instead, bid values need to be optimized in real-time to maximize expected bidder profits. This online optimization of deal profits is made possible by the advent of ad exchanges offering real-time (spot) bidding. To this end, we propose a real-time bidding strategy for group-buying deals based on the online optimization of the bid values. We derive the expected bidder profit of deals as a function of the bid amounts, and dynamically vary bids to maximize profits. Further, to satisfy time constraints of the online bidding, we present methods of minimizing computation timings. We evaluate the proposed bidding on a multi-million click stream of 935 ads. The method shows significant profit improvement over the existing strategies.
Degree relations of triangles in real-world networks and graph models BIBAFull-Text 1712-1716
  Nurcan Durak; Ali Pinar; Tamara G. Kolda; C. Seshadhri
Triangles are an important building block and distinguishing feature of real-world networks, but their structure is still poorly understood. Despite numerous reports on the abundance of triangles, there is very little information on what these triangles look like. We initiate the study of degree-labeled triangles, -- specifically, degree homogeneity versus heterogeneity in triangles. This yields new insight into the structure of real-world graphs. We observe that networks coming from social and collaborative situations are dominated by homogeneous triangles, i.e., degrees of vertices in a triangle are quite similar to each other. On the other hand, information networks (e.g., web graphs) are dominated by heterogeneous triangles, i.e., the degrees in triangles are quite disparate. Surprisingly, nodes within the top 1% of degrees participate in the vast majority of triangles in heterogeneous graphs. We investigate whether current graph models reproduce the types of triangles that are observed in real data and observe that most models fail to accurately capture these salient features.
A probabilistic approach to mining geospatial knowledge from social annotations BIBAFull-Text 1717-1721
  Suradej Intagorn; Kristina Lerman
User-generated content, such as photos and videos, is often annotated by users with free-text labels, called tags. Increasingly, such content is also georeferenced, i.e., it is associated with geographic coordinates. The implicit relationships between tags and their locations can tell us much about how people conceptualize places and relations between them. However, extracting such knowledge from social annotations presents many challenges, since annotations are often ambiguous, noisy, uncertain and spatially inhomogeneous. We introduce a probabilistic framework for modeling georeferenced annotations and a method for learning model parameters from data. The framework is flexible and general, and can be used in a variety of applications that mine geospatial knowledge from user-generated content. Specifically, we study three problems: extracting place semantics, predicting locations of photos and learning part-of relations between places. We show our method performs well compared to state-of-the-art approaches developed for the first two problems, and offers a novel solution to the problem of learning relations between places.
Providing grades and feedback for student summaries by ontology-based information extraction BIBAFull-Text 1722-1726
  Fernando Gutierrez; Dejing Dou; Stephen Fickas; Gina Griffiths
Automatic grading systems for summaries and essays have been studied for years. Most commercial and research implementations are based in statistical methods, such as Latent Semantic Analysis (LSA), which can provide high accuracy on similarity between the essay and the graded or standard essays, but they can offer very limited feedback. In the present work, we propose a novel method to provide both grades and meaningful feedback for student summaries by Ontology-based Information Extraction (OBIE). We use ontological concepts and relationships to create extraction rules to identify correct statements. Based on ontology constraints (e.g., disjointness between concepts), we define patterns that are logically inconsistent with the ontology to create rules to extract incorrect statements. Experiments show that the grades given to 18 student summaries on Ecosystems by OBIE are correlated to human gradings. OBIE also provide meaningful feedback on the errors those students made in their summaries.
Joint bilingual name tagging for parallel corpora BIBAFull-Text 1727-1731
  Qi Li; Haibo Li; Heng Ji; Wen Wang; Jing Zheng; Fei Huang
Traditional isolated monolingual name taggers tend to yield inconsistent results across two languages. In this paper, we propose two novel approaches to jointly and consistently extract names from parallel corpora. The first approach uses standard linear-chain Conditional Random Fields (CRFs) as the learning framework, incorporating cross-lingual features propagated between two languages. The second approach is based on a joint CRFs model to jointly decode sentence pairs, incorporating bilingual factors based on word alignment. Experiments on Chinese-English parallel corpora demonstrated that the proposed methods significantly outperformed monolingual name taggers, were robust to automatic alignment noise and achieved state-of-the-art performance. With only 20%of the training data, our proposed methods can already achieve better performance compared to the baseline learned from the whole training set.1
Using program synthesis for social recommendations BIBAFull-Text 1732-1736
  Alvin Cheung; Armando Solar-Lezama; Samuel Madden
This paper presents a new approach to select events of interest to users in a social media setting where events are generated from mobile devices. We argue that the problem is best solved by inductive learning, where the goal is to first generalize from the users' expressed "likes" and "dislikes" of specific events, then to produce a program that can be used to collect only data of interest.
   The key contribution of this paper is a new algorithm that combines machine learning techniques with program synthesis technology to learn users' preferences. We show that when compared with the more standard approaches, our new algorithm provides up to order-of-magnitude reductions in model training time, and significantly higher prediction accuracies for our target application.1
Web-scale multi-task feature selection for behavioral targeting BIBAFull-Text 1737-1741
  Amr Ahmed; Mohamed Aly; Abhimanyu Das; Alexander J. Smola; Tasos Anastasakos
A typical behavioral targeting system optimizing purchase activities, called conversions, faces two main challenges: the web-scale amounts of user histories to process on a daily basis, and the relative sparsity of conversions. In this paper, we try to address these challenges through feature selection. We formulate a multi-task (or group) feature-selection problem among a set of related tasks (sharing a common set of features), namely advertising campaigns. We apply a group-sparse penalty consisting of a combination of an l1 and l2 penalty and an associated fast optimization algorithm for distributed parameter estimation. Our algorithm relies on a variant of the well known Fast Iterative Thresholding Algorithm (FISTA), a closed-form solution for mixed norm programming and a distributed subgradient oracle. To efficiently handle web-scale user histories, we present a distributed inference algorithm for the problem that scales to billions of instances and millions of attributes. We show the superiority of our algorithm in terms of both sparsity and ROC performance over baseline feature selection methods (both single-task-regularization and multi-task mutual-information gain).
Balanced coverage of aspects for text summarization BIBAFull-Text 1742-1746
  Takuya Makino; Hiroya Takamura; Manabu Okumura
We propose a new model for the guided text summarization task. In this task, it is required that a generated summary covers all the aspects, which are predefined for the topic of the given document cluster; for example, aspects for the topic "Accidents and Natural Disasters" include WHAT, WHEN, WHERE, WHY, WHO AFFECTED, DAMAGES and COUNTERMEASURES. We use as a scorer for an aspect, the maximum entropy classifier that predicts whether each sentence reflects the aspect or not. We formalize the coverage of the aspects as a max-min problem, which enables a summary to cover aspects in a well-balanced manner. In the max-min problem, the minimum of the aspect scores is going to be maximized so that the summary contains all the aspects as much as possible. Furthermore, we integrate the model based on the max-min problem with the maximum coverage summarization model, which generates a summary containing as many conceptual units as possible. Through the experiments on benchmark datasets for the guided summarization, we show that our model outperforms other approaches in terms of ROUGE-2.
Dynamic effects of ad impressions on commercial actions in display advertising BIBAFull-Text 1747-1751
  Joel Barajas; Ram Akella; Marius Holtan; Jaimie Kwon; Aaron Flores; Victor Andrei
In this paper, we develop a time series approach, based on Dynamic Linear Models (DLM), to estimate the impact of ad impressions on the daily number of commercial actions when no user tracking is possible. The proposed method uses aggregate data, and hence it is simple to implement without expensive infrastructure. Specifically, we model the impact of daily number of ad impressions in daily number of commercial actions. We incorporate persistence of campaign effects on actions assuming a decay factor. We relax the assumption of a linear impact of ads on actions using the log-transformation. We also account for outliers with long-tailed distributions fitted and estimated automatically without a pre-defined threshold. This is applied to observational data post-campaign and does not require an experimental set-up. We apply the method to data from one commercial ad network on 2,885 campaigns for 1,251 products during six months, to calibrate and perform model selection. We set up a randomized experiment for two campaigns where user tracking is feasible. We find that the output of the proposed method is consistent with the results of A/B testing with similar confidence intervals.
A hybrid approach for efficient provenance storage BIBAFull-Text 1752-1756
  Yulai Xie; Dan Feng; Zhipeng Tan; Lei Chen; Kiran-Kumar Muniswamy-Reddy; Yan Li; Darrell D. E. Long
Efficient provenance storage is an essential step towards the adoption of provenance. In this paper, we analyze the provenance collected from multiple workloads with a view towards efficient storage. Based on our analysis, we characterize the properties of provenance with respect to long term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of web graph compression (adapted for provenance) and dictionary encoding, provides the best tradeoff in terms of compression ratio, compression time and query performance when compared to other compression schemes.

Information retrieval short paper session

Content-based relevance estimation on the web using inter-document similarities BIBAFull-Text 1769-1773
  Fiana Raiber; Oren Kurland; Moshe Tennenholtz
In adversarial and noisy search settings as the Web, the document-query surface level similarity can be a highly misleading relevance signal. Thus, devising content-based relevance estimation (ranking) approaches becomes highly challenging. We address this challenge using two methods that utilize inter-document similarities in an initially retrieved list. The first removes documents from the list that exhibit high query similarity, but for which there is insufficient additional support for relevance that is based on inter-document similarities. The method is based on a probabilistic model that decouples document-query similarities from relevance estimation. The second method re-ranks the list by "rewarding" documents that exhibit high similarity both to the query and to other documents in the list. Both methods incorporate, in addition, at the model level, query-independent document quality estimates. Extensive empirical evaluation demonstrates the merits of our methods.
Trust prediction via aggregating heterogeneous social networks BIBAFull-Text 1774-1778
  Jin Huang; Feiping Nie; Heng Huang; Yi-Cheng Tu
Along with the increasing popularity of social web sites, users rely more on the trustworthiness information for many online activities among users. However, such social network data often suffers from severe data sparsity and are not able to provide users with enough information. Therefore, trust prediction has emerged as an important topic in social network research. Traditional approaches explore the topology of trust graph. Previous research in sociology and our life experience suggest that people who are in the same social circle often exhibit similar behavior and tastes. Such ancillary information, is often accessible and therefore could potentially help the trust prediction. In this paper, we address the link prediction problem by aggregating heterogeneous social networks and propose a novel joint manifold factorization (JMF) method. Our new joint learning model explores the user group level similarity between correlated graphs and simultaneously learns the individual graph structure, therefore the shared structures and patterns from multiple social networks can be utilized to enhance the prediction tasks. As a result, we not only improve the trust prediction in the target graph, but also facilitate other information retrieval tasks in the auxiliary graphs. To optimize the objective function, we break down the proposed objective function into several manageable sub-problems, then further establish the theoretical convergence with the aid of auxiliary function. Extensive experiments were conducted on real world data sets and all empirical results demonstrated the effectiveness of our method.
Estimating interleaved comparison outcomes from historical click data BIBAFull-Text 1779-1783
  Katja Hofmann; Shimon Whiteson; Maarten de Rijke
Interleaved comparison methods, which compare rankers using click data, are a promising alternative to traditional information retrieval evaluation methods that require expensive explicit judgments. A major limitation of these methods is that they assume access to live data, meaning that new data must be collected for every pair of rankers compared. We investigate the use of previously collected click data (i.e., historical data) for interleaved comparisons. We start by analyzing to what degree existing interleaved comparison methods can be applied and find that a recent probabilistic method allows such data reuse, even though it is biased when applied to historical data. We then propose an interleaved comparison method that is based on the probabilistic approach but uses importance sampling to compensate for bias. We experimentally confirm that probabilistic methods make the use of historical data for interleaved comparisons possible and effective.
Automatic image annotation using tag-related random search over visual neighbors BIBAFull-Text 1784-1788
  Zijia Lin; Guiguang Ding; Mingqing Hu; Jianmin Wang; Jiaguang Sun
In this paper, we propose a novel image auto-annotation model using tag-related random search over range-constrained visual neighbors of the to-be-annotated image. The proposed model, termed as TagSearcher, observes that the annotating performances of many previous visual-neighbor-based models are generally sensitive to the quantity setting of visual neighbors, and the probabilities for visual neighbors to be selected is better to be tag-dependent, meaning that each candidate tag can have its own trustworthy part of visual neighbors for score prediction. And thus TagSearcher uses a constrained range rather than an identical and fixed number of visual neighbors for auto-annotation. By performing a novel tag-related random search process over the graphical model made up of range-constrained visual neighbors, TagSearcher can find the trustworthy part for each candidate tag, and further utilize both visual similarities and tag correlations for score prediction. With the range constraint for visual neighbors and the tag-related random search process, TagSearcher can not only achieve satisfactory annotating performances, but also reduce the performance sensitivity. Experiments conducted on benchmark Corel5k well demonstrate its rationality and effectiveness.
Diversionary comments under political blog posts BIBAFull-Text 1789-1793
  Jing Wang; Clement T. Yu; Philip S. Yu; Bing Liu; Weiyi Meng
An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments under political blog posts are defined as comments that deliberately twist the bloggers' intention and divert the topic to another one. The purpose is to distract readers from the original topic and draw attention to a new topic. Given that political blogs have significant impact on the society, we believe it is imperative to identify such comments. We then categorize diversionary comments into 5 types, and propose an effective technique to rank comments in descending order of being diversionary. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. Our evaluation on 2,109 comments under 20 different blog posts from Digg.com shows that the proposed method achieves the high mean average precision (MAP) of 92.6%. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.
Discover breaking events with popular hashtags in Twitter BIBAFull-Text 1794-1798
  Anqi Cui; Min Zhang; Yiqun Liu; Shaoping Ma; Kuo Zhang
In this paper, we utilize tags in Twitter (the hashtags) as an indicator of events. We first study the properties of hashtags for event detection. Based on several observations, we proposed three attributes of hashtags, including (1) instability for temporal analysis, (2) Twitter meme possibility to distinguish social events from virtual topics or memes, and (3) authorship entropy for mining the most contributed authors. Based on these attributes, breaking events are discovered with hashtags, which cover a wide range of social events among different languages in the real world.
Query likelihood with negative query generation BIBAFull-Text 1799-1803
  Yuanhua Lv; ChengXiang Zhai
The query likelihood retrieval function has proven to be empirically effective for many retrieval tasks. From theoretical perspective, however, the justification of the standard query likelihood retrieval function requires an unrealistic assumption that ignores the generation of a "negative query" from a document. This suggests that it is a potentially non-optimal retrieval function.
   In this paper, we attempt to improve the query likelihood function by bringing back the negative query generation. We propose an effective approach to estimate the probabilities of negative query generation based on the principle of maximum entropy, and derive a more complete query likelihood retrieval function that also contains the negative query generation component. The proposed approach not only bridges the theoretical gap in the existing query likelihood retrieval function, but also improves retrieval effectiveness significantly with no additional computational cost.
On the connections between explicit semantic analysis and latent semantic analysis BIBAFull-Text 1804-1808
  Chao Liu; Yi-Min Wang
Semantic analysis tries to solve problems arising from polysemy and synonymy that are abundant in natural languages. Recently, Gabrilovich and Markovitch propose the Explicit Semantic Analysis (ESA) technique, which complements the well-known Latent Semantic Analysis (LSA) technique. In this paper, we show that the two techniques are not as distinct as their names suggest; instead, we find that ESA is equivalent to a LSA variant, and this equivalence generalizes to all kernel methods using kernels arising from the canonical dot product. Effectively, this result guarantees that ESA would not outperform the peak efficacy of LSA for any applications using the above kernel methods. In short, this paper for the first time establishes the connections between ESA and LSA, quantifies their relative efficacy, and generalizes the result to a big category of kernel methods.
Variance maximization via noise injection for active sampling in learning to rank BIBAFull-Text 1809-1813
  Wenbin Cai; Ya Zhang
Active learning for ranking, which is to selectively label the most informative examples, has been widely studied in recent years. In this paper, we propose a general active learning for ranking strategy called Variance Maximization (VM). The algorithm relies on noise injection to perturb the original unlabeled examples and generate the rank distribution of each example. Using a DCG-like gain function to measure each ranked list sampled from the rank distribution, Variance Maximization selects the unlabeled example with the largest variance in the gain. The VM strategy is applied at both the query level and the document level, and a two-stage active learning algorithm is further derived. Experimental results on both the LETOR 4.0 dataset and a real-world Web search ranking dataset have demonstrated the effectiveness of the proposed active learning approach.
More than relevance: high utility query recommendation by mining users' search behaviors BIBAFull-Text 1814-1818
  Xiaofei Zhu; Jiafeng Guo; Xueqi Cheng; Yanyan Lan
Query recommendation plays a critical role in helping users' search. Most existing approaches on query recommendation aim to recommend relevant queries. However, the ultimate goal of query recommendation is to assist users to reformulate queries so that they can accomplish their search task successfully and quickly. Only considering relevance in query recommendation is apparently not directly toward this goal. In this paper, we argue that it is more important to directly recommend queries with high utility, i.e., queries that can better satisfy users' information needs. For this purpose, we propose a novel generative model, referred to as Query Utility Model (QUM), to capture query utility by simultaneously modeling users' reformulation and click behaviors. The experimental results on a publicly released query log show that, our approach is more effective in helping users find relevant search results and thus satisfying their information needs.
Finding nuggets in IP portfolios: core patent mining through textual temporal analysis BIBAFull-Text 1819-1823
  Po Hu; Minlie Huang; Peng Xu; Weichang Li; Adam K. Usadi; Xiaoyan Zhu
Patents are critical for a company to protect its core technologies. Effective patent mining in massive patent databases can provide companies with valuable insights to develop strategies for IP management and marketing. In this paper, we study a novel patent mining problem of automatically discovering core patents (i.e., patents with high novelty and influence in a domain). We address the unique patent vocabulary usage problem, which is not considered in traditional word-based statistical methods, and propose a topic-based temporal mining approach to quantify a patent's novelty and influence. Comprehensive experimental results on real-world patent portfolios show the effectiveness of our method.
Interest-matching information propagation in multiple online social networks BIBAFull-Text 1824-1828
  Yilin Shen; Thang N. Dinh; Huiyuan Zhang; My T. Thai
Online social networks have become an imperative channel for extremely fast information propagation and influence. Thus, the problem of finding a minimum number of seed users who can eventually influence as many users in the network as possible has become one of the central research topics recently. Unfortunately, most of related works have only focused on the network topologies and largely ignored many other important factors such as the users' engagements and the negative or positive impacts between users. More challengingly, the behavior of information propagation across multiple networks simultaneously remains an untrodden area and becomes an urgent need. Our work is the first attempt to tackle the above problem in multiple networks, considering these lacking important factors. In order to capture the users' engagement, we propose to targeting the set of interest-matching users whose interests are similar to what we try to propagate. Then, we develop our Iterative Semi-Supervising Learning based approach to identify the minimum seed users. We validate the effectiveness of our solution by using real-world Twitter-Foursquare networks and academic collaboration multiple networks.
Customizing search results for non-native speakers BIBAFull-Text 1829-1833
  Theodoros Lappas; Michail Vlachos
Blog posts, news articles and other webpages are present on the web in multiple languages. Standard search engines evaluate the relevance of the candidate documents to the given query. However, when considering documents with overlapping content, many of them written in a foreign language other than the user's own native tongue, it is beneficial to promote documents that are easy enough for the user to read. Here, we show how to rank a collection of foreign documents based on both: a) relevance to the query, and b) the comprehension difficulty of the document. We design effective ranking operators that evaluate the difficulty of a foreign document with respect to the user's native language. We show that existing search engines can easily augment their scoring function by incorporating the proposed comprehensibility metrics. Finally, we provide extensive experimental evidence that the comprehensibility-aware ranking model significantly improves the standard relevance-based ranking paradigm.
Quality models for microblog retrieval BIBAFull-Text 1834-1838
  Jaeho Choi; W. Bruce Croft; Jin Young Kim
Microblog services typically contain very short documents (e.g., tweets) containing comments about the latest news and events. Many of these documents are not informative or have very little content due to their personal and ephemeral nature. Providing effective retrieval in a microblog service will require addressing the challenge of distinguishing the high-quality, informative documents from the others. Recent work has focused on finding features that indicate the quality of microblog documents, but the impact these quality features on retrieval is not clear. In this paper, we suggest a low-cost quality model using surrogate judgments based on user behavior (i.e., retweets) that can be collected automatically. We analyze the relationship between document informativeness and relevance judgments for microblog retrieval. Then we demonstrate that our behavior-based quality metric has a high correlation with manual judgments. Also, we perform experiments to study the impact of the quality model on microblog retrieval. The results based on the TREC Microblog track show that the proposed quality model, combined with a variety of retrieval models, can improve retrieval performance and is competitive with a model trained using manual relevance judgments.
Do ads compete or collaborate?: designing click models with full relationship incorporated BIBAFull-Text 1839-1843
  Xin Xin; Irwin King; Ritesh Agrawal; Michael R. Lyu; Heyan Huang
Traditionally click models predict click-through rate (CTR) of an advertisement (ad) independent of other ads. Recent researches however indicate that the CTR of an ad is dependent on the quality of the ad itself but also of the neighboring ads. Using historical click-through data of a commercially available ad server, we identify two types (competing and collaborating) of influences among sponsored ads and further propose a novel click-model, Full Relation Model (FRM), which explicitly models dependencies between ads. On a test data, FRM shows significant improvement in CTR prediction as compared to earlier click models.
Exploiting concept hierarchy for result diversification BIBAFull-Text 1844-1848
  Wei Zheng; Hui Fang; Conglei Yao
The goal of result diversification is to maximize the coverage of query subtopics while minimizing the redundancy in the search results. Intuitively, it is more desirable for a diversification system to cover independent subtopics since it would retrieve sets of non-overlapped relevant documents, which leads to less redundancy in the search results. Unfortunately, existing diversification methods assume that query subtopics are independent and ignore their relations in the diversification process. To overcome this limitation, we propose to exploit concept hierarchies to extract query subtopics and infer their relations. We then apply axiomatic approaches to derive a structural diversification method that can leverage the subtopic relations in result diversification. Experimental results over an enterprise collection show that the relations among query subtopics are useful to improve the diversification performance.
Ranking news events by influence decay and information fusion for media and users BIBAFull-Text 1849-1853
  Liang Kong; Shan Jiang; Rui Yan; Shize Xu; Yan Zhang
In many cases, people would like to read the news with great importance on the Internet. However, what users can grasp covers a very small part compared with the huge amount of news which never stops increasing. In this paper, we try to find what users are most likely to be interested in. We notice that media focus plays an essential role in distinguishing news topics and user attention is also an important factor. Therefore, we first propose five strategies which only exploit media focus to decide news influence impact. Then we provide three strategies to combine user attention with media focus. Meanwhile, we also take four types of interaction between user attention and media focus into consideration. To the best of our knowledge, this is the first work to establish different models for computing influence decay of news topics. Experiments show that better influence scores will be achieved by a decay algorithm based on Ebbinghaus forgetting curve and information fusion by considering interactions between user attention and media focus.
Leveraging tagging for neighborhood-aware probabilistic matrix factorization BIBAFull-Text 1854-1858
  Le Wu; Enhong Chen; Qi Liu; Linli Xu; Tengfei Bao; Lei Zhang
Collaborative Filtering (CF) is a popular way to build recommender systems and has been successfully employed in many applications. Generally, two kinds of approaches to CF, the local neighborhood methods and the global matrix factorization models, have been widely studied. Though some previous researches target on combining the complementary advantages of both approaches, the performance is still limited due to the extreme sparsity of the rating data. Therefore, it is necessary to consider more information for better reflecting user preference and item content. To that end, in this paper, by leveraging the extra tagging data, we propose a novel unified two-stage recommendation framework, named Neighborhood-aware Probabilistic Matrix Factorization (NHPMF). Specifically, we first use the tagging data to select neighbors of each user and each item, then add unique Gaussian distributions on each user's (item's) latent feature vector in the matrix factorization to ensure similar users (items) will have similar latent features}. Since the proposed method can effectively explores the external data source (i.e., tagging data) in a unified probabilistic model, it leads to more accurate recommendations. Extensive experimental results on two real world datasets demonstrate that our NHPMF model outperforms the state-of-the-art methods.
Semantic context learning with large-scale weakly-labeled image set BIBAFull-Text 1859-1863
  Yao Lu; Wei Zhang; Ke Zhang; Xiangyang Xue
There are a large number of images available on the web; meanwhile, only a subset of web images can be labeled by professionals because manual annotation is time-consuming and labor-intensive. Although we can now use the collaborative image tagging system, e.g., Flickr, to get a lot of tagged images provided by Internet users, these labels may be incorrect or incomplete. Furthermore, semantics richness requires more than one label to describe one image in real applications, and multiple labels usually interact with each other in semantic space. It is of significance to learn semantic context with large-scale weakly-labeled image set in the task of multi-label annotation. In this paper, we develop a novel method to learn semantic context and predict the labels of web images in a semi-supervised framework. To address the scalability issue, a small number of exemplar images are first obtained to cover the whole data cloud; then the label vector of each image is estimated as a local combination of the exemplar label vectors. Visual context, semantic context, and neighborhood consistency in both visual and semantic spaces are sufficiently leveraged in the proposed framework. Finally, the semantic context and the label confidence vectors for exemplar images are both learned in an iterative way. Experimental results on the real-world image dataset demonstrate the effectiveness of our method.
Sketch-based indexing of n-words BIBAFull-Text 1864-1868
  Samuel Huston; J. Shane Culpepper; W. Bruce Croft
Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing these types of statistics using standard inverted indexes requires unreasonable processing time or incurs a substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance.
   In this paper, we present and analyze a new index structure designed to improve query efficiency in term dependency retrieval models, with bounded space requirements. By adapting a class of (ε,δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate various statistics important in term dependency models with low, probabilistically bounded error rates. The space requirements of the sketch index structure is largely independent of this size and the number of phrase term dependencies.
   Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of all n-grams consisting of between 1 and 5 words extracted from the Clueweb-Part-B collection to less than 0.2% of the requirements of an equivalent full index. We show that n-gram queries of 5 words can be processed more efficiently than in current alternatives, such as next-word indexes. We show retrieval using the sketch index to be up to 400 times faster than with positional indexes, and 15 times faster than next-word indexes.
Interactive and context-aware tag spell check and correction BIBAFull-Text 1869-1873
  Francesco Bonchi; Ophir Frieder; Franco Maria Nardini; Fabrizio Silvestri; Hossein Vahabi
Collaborative content creation and annotation creates vast repositories of all sorts of media, and user-defined tags play a central role as they are a simple yet powerful tool for organizing, searching and exploring the available resources. We observe that when a user annotates a resource with a set of tags, those tags are introduced one at a time. Therefore, when the fourth tag is introduced, a knowledge represented by the previous three tags, i.e., the context in which the fourth tag is produced, is available and exploitable for generating potential correction of the current tag. This context, together with the "wisdom of the crowd" represented by the co-occurrences of tags in all the resources of the repository, can be exploited to provide interactive tag spell check and correction. We develop this idea in a framework, based on a weighted tag co-occurrence graph and on nodes relatedness measures defined on weighted neighborhoods. We test our proposal on a dataset coming from YouTube. The results show that our framework is effective as it outperforms two important baselines. We also show that it is efficient, thus enabling its use in modern tagging services.
Federated search in the wild: the combined power of over a hundred search engines BIBAFull-Text 1874-1878
  Dong Nguyen; Thomas Demeester; Dolf Trieschnigg; Djoerd Hiemstra
Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgements for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.
From sBoW to dCoT marginalized encoders for text representation BIBAFull-Text 1879-1884
  Zhixiang (Eddie) Xu; Minmin Chen; Kilian Q. Weinberger; Fei Sha
In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.
Task tours: helping users tackle complex search tasks BIBAFull-Text 1885-1889
  Ahmed Hassan; Ryen W. White
Complex search tasks such as planning a vacation often comprise multiple queries and may span a number of search sessions. When engaged in such tasks, users may require holistic support in determining the required task activities. Unfortunately, current search engines do not offer such support to their users. In this paper, we propose methods to automatically generate task tours comprising a starting task and a set of relevant related tasks, some or all of which may be necessary to satisfy a user's information needs. Applications of the tours include helping users understand the required steps to complete a task, finding URLs related to the active task, and alerting users to activities they may have missed. We demonstrate through experimentation with human judges and large-scale search logs that our tours are of good quality and can benefit a significant fraction of search engine users.
Structured query reformulations in commerce search BIBAFull-Text 1890-1894
  Sreenivas Gollapudi; Samuel Ieong; Anitha Kannan
Recent work in commerce search has shown that understanding the semantics in user queries enables more effective query analysis and retrieval of relevant products. However, due to lack of sufficient domain knowledge, user queries often include terms that cannot be mapped directly to any product attribute. For example, a user looking for designer handbags might start with such a query because she is not familiar with the manufacturers, the price ranges, and/or the material that gives a handbag designer appeal. Current commerce search engines treat terms such as designer as keywords and attempt to match them to contents such as product reviews and product descriptions, often resulting in poor user experience.
   In this study, we propose to address this problem by reformulating queries involving terms such as designer, which we call modifiers, to queries that specify precise product attributes. We learn to rewrite the modifiers to attribute values by analyzing user behavior and leveraging structured data sources such as the product catalog that serves the queries. We first produce a probabilistic mapping between the modifiers and attribute values based on user behavioral data. These initial associations are then used to retrieve products from the catalog, over which we infer sets of attribute values that best describe the semantics of the modifiers. We evaluate the effectiveness of our approach based on a comprehensive Mechanical Turk study. We find that users agree with the attribute values selected by our approach in about 95% of the cases and they prefer the results surfaced for our reformulated queries to ones for the original queries in 87% of the time.
Towards jointly extracting aspects and aspect-specific sentiment knowledge BIBAFull-Text 1895-1899
  Xueke Xu; Songbo Tan; Yue Liu; Xueqi Cheng; Zheng Lin
In this paper, we aim to jointly extract aspects and aspect-specific sentiment knowledge from online reviews, where the sentiment knowledge refers to the aspect-specific opinion words along with their aspect-aware sentiment polarities. To this end, we propose a Joint Aspect/Sentiment model (JAS). JAS detects aspect-specific opinion words by integrating opinion word lexicon knowledge to explicitly separate opinion words from factual words. More importantly, JAS exploits sentiment prior and aspect-contextual sentence-level co-occurrences of opinion words in reviews to further identify aspect-aware sentiment polarities for the opinion words. We apply the learned aspect-specific sentiment knowledge to practical aspect-level sentiment analysis tasks. Experimental results show the effectiveness of JAS in learning aspect-specific sentiment knowledge and the practical value of this knowledge when applied to aspect-level sentiment classification.
Collaborative ranking: improving the relevance for tail queries BIBAFull-Text 1900-1904
  Ke Zhou; Xin Li; Hongyuan Zha
It is well known that tail queries contribute to a substantial fraction of distinct queries submitted to search engines and thus become a major battle field for search engines. Unfortunately, compared with popular queries, it is much more difficult to obtain good search results for tail queries due to the lack of important relevance signals, such as user clicks, phrase matches and so on. In this paper, we propose to utilize the similarities between different queries to overcome the data sparsity problem for tail queries. Specifically, we propose to jointly learn query similarities and the ranking function from data so that the relevance signals of different but related queries can be collaboratively pooled to enhance the ranking of tail queries. We emphasize that the joint optimization is critical so that the learned query similarity function can adapt to the problem of learning ranking functions. Our proposed method is evaluated on two data sets and the results show that our method improves the relevance of tail queries over several baseline alternatives.
BiasTrust: teaching biased users about controversial topics BIBAFull-Text 1905-1909
  V. G. Vinod Vydiswaran; ChengXiang Zhai; Dan Roth; Peter Pirolli
Deciding whether a claim is true or false often requires understanding the evidence supporting and contradicting the claim. However, when learning about a controversial claim, human biases and viewpoints may affect which evidence documents are considered "trustworthy" or credible. It is important to overcome this bias and know both viewpoints to get a balanced perspective. In this paper, we study various factors that affect learning about the truthfulness of controversial claims. We designed a user study to understand the impact of these factors. Specifically, we studied the impact of presenting evidence with contrasting viewpoints and source expertise rating on how users accessed the evidence documents. This would help us optimize how to teach users about controversial topics in the most effective way, and to design better claim verification systems. We find that users do not seek contrasting viewpoints by themselves, but explicitly presenting contrasting evidence helps them get a well-rounded understanding of the topic. Furthermore, explicit knowledge of the source credibility and the context not only affects what users read, but also how credible they perceive the document to be.
Recommending citations: translating papers into references BIBAFull-Text 1910-1914
  Wenyi Huang; Saurabh Kataria; Cornelia Caragea; Prasenjit Mitra; C. Lee Giles; Lior Rokach
When we write or prepare to write a research paper, we always have appropriate references in mind. However, there are most likely references we have missed and should have been read and cited. As such a good citation recommendation system would not only improve our paper but, overall, the efficiency and quality of literature search.
   Usually, a citation's context contains explicit words explaining the citation. Using this, we propose a method that "translates" research papers into references. By considering the citations and their contexts from existing papers as parallel data written in two different "languages", we adopt the translation model to create a relationship between these two "vocabularies".
   Experiments on both CiteSeer and CiteULike dataset show that our approach outperforms other baseline methods and increase the precision, recall and f-measure by at least 5% to 10%, respectively. In addition, our approach runs much faster in the both training and recommending stage, which proves the effectiveness and the scalability of our work.
Query-biased learning to rank for real-time Twitter search BIBAFull-Text 1915-1919
  Xin Zhang; Ben He; Tiejian Luo; Baobin Li
By incorporating diverse sources of evidence of relevance, learning to rank has been widely applied to real-time Twitter search, where users are interested in fresh relevant messages. Such approaches usually rely on a set of training queries to learn a general ranking model, which we believe that the benefits brought by learning to rank may not have been fully exploited as the characteristics and aspects unique to the given target queries are ignored. In this paper, we propose to further improve the retrieval performance of learning to rank for real-time Twitter search, by taking the difference between queries into consideration. In particular, we learn a query-biased ranking model with a semi-supervised transductive learning algorithm so that the query-specific features, e.g. the unique expansion terms, are utilized to capture the characteristics of the target query. This query-biased ranking model is combined with the general ranking model to produce the final ranked list of tweets in response to the given target query. Extensive experiments on the standard TREC Tweets11 collection show that our proposed query-biased learning to rank approach outperforms strong baseline, namely the conventional application of the state-of-the-art learning to rank algorithms.
Discovering logical knowledge for deep question answering BIBAFull-Text 1920-1924
  Zhao Liu; Xipeng Qiu; Ling Cao; Xuanjing Huang
Most open-domain question answering systems achieve better performances with large corpora, such as Web, by taking advantage of information redundancy. However, explicit answers are not always mentioned in the corpus, many answers are implicitly contained and can only be deducted by inference. In this paper, we propose an approach to discover logical knowledge for deep question answering, which automatically extracts knowledge in an unsupervised, domain-independent manner from background texts and reasons out implicit answers for the questions. Firstly, we use semantic role labeling to transform natural language expressions to predicates in first-order logic. Then we use association analysis to uncover the implicit relations among these predicates and build propositions for inference. Since our knowledge is drawn from different sources, we use Markov logic to merge multiple knowledge bases without resolving their inconsistencies. Our experiments show that these propositions can improve the performance of question answering significantly.
Mining noisy tagging from multi-label space BIBAFull-Text 1925-1929
  Zhongang Qi; Ming Yang; Zhongfei (Mark) Zhang; Zhengyou Zhang
In this paper we study the problem of mining noisy tagging. Most of the existing discriminative classification methods to this problem only consider one tag at a time as the classification target, and completely ignore the rest of the given tags at the same time. In this paper we argue that all the given multiple tags can be utilized simultaneously as an additional feature and the information contained in the multi-label space can be taken advantage of to improve the performance of the classification. We first propose a novel distance measure to compute the distance between instances in the multi-label space. Then we propose several novel methods to incorporate the information of the multi-label space into the discriminative classification methods in one view learning or in two views learning to solve a general multi-label classification problem and to mitigate the influence of the noise in the classification. We apply the proposed solutions to the problem with a more specific context -- noisy image annotation, and evaluate the proposed methods on a standard dataset from the related literature. Experiments show that they are superior to the peer methods in the existing literature on solving the problem of mining noisy tagging.
Learning from mistakes: towards a correctable learning algorithm BIBAFull-Text 1930-1934
  Karthik Raman; Krysta M. Svore; Ran Gilad-Bachrach; Chris J. C. Burges
Many learning algorithms generate complex models that are difficult for a human to interpret, debug, and extend. In this paper, we address this challenge by proposing a new learning paradigm called correctable learning, where the learning algorithm receives external feedback about which data examples are incorrectly learned. We define a set of metrics which measure the correctability of a learning algorithm. We then propose a simple and efficient correctable learning algorithm which learns local models for different regions of the data space. Given an incorrect example, our method samples data in the neighborhood of that example and learns a new, more correct local model over that region. Experiments over multiple classification and ranking datasets show that our correctable learning algorithm offers significant improvements over the state-of-the-art techniques.
CONSENTO: a new framework for opinion based entity search and summarization BIBAFull-Text 1935-1939
  Jaehoon Choi; Donghyeon Kim; Seongsoon Kim; Junkyu Lee; Sangrak Lim; Sunwon Lee; Jaewoo Kang
Search engines have become an important decision making tool today. Decision making queries are often subjective, such as "a good birthday present for my girlfriend", "best action movies in 2010", to name a few. Unfortunately, such queries may not be answered properly by conventional search systems. In order to address this problem, we introduce Consento, a consensus search engine designed to answer subjective queries. Consento performs segment indexing, as opposed to document indexing, to capture semantics from user opinions more precisely. In particular, we define a new indexing unit, Maximal Coherent Semantic Unit (MCSU).
   An MCSU represents a segment of a document, which captures a single coherent semantic. We also introduce a new ranking method, called ConsensusRank that counts online comments referring to an entity as a weighted vote. In order to validate the efficacy of the proposed framework, we compare Consento with standard retrieval models and their recent extensions for opinion based entity ranking. Experiments using movie and hotel data show the effectiveness of our framework.
Search result presentation based on faceted clustering BIBAFull-Text 1940-1944
  Benno Stein; Tim Gollub; Dennis Hoppe
We propose a competence partitioning strategy for Web search result presentation: the unmodified head of a ranked result list is combined with a clustering of documents from the result list tail. We identify two principles to which such a clustering must adhere to improve the user's search experience: (1) Avoid the unwanted effect of query aspect repetition, which is called shadowing here. (2) Avoid extreme clusterings, i.e., neither the number of cluster labels nor the number of documents per cluster should exceed the size of the result list head. We present measures to quantify the shadowing effect, and with Faceted Clustering we introduce an algorithm that optimizes the identified principles. The key idea of Faceted Clustering is a dynamic, user-controlled reorganization of a clustering, similar to a faceted navigation system. We report on evaluations using the AMBIENT corpus and demonstrate the potential of our approach by a comparison with two well-known clustering search engines.
PolariCQ: polarity classification of political quotations BIBAFull-Text 1945-1949
  Rawia Awadallah; Maya Ramanath; Gerhard Weikum
We consider the problem of automatically classifying quotations about political debates into both topic and polarity. These quotations typically appear in news media and online forums. Our approach maps quotations onto one or more topics in a category system of political debates, containing more than a thousand fine-grained topics. To overcome the difficulty that pro/con classification faces due to the brevity of quotations and sparseness of features, we have devised a model of quotation expansion that harnesses antonyms from thesauri like WordNet. We developed a suite of statistical language models, judiciously customized to our settings, and use these to define similarity measures for unsupervised or supervised classifications. Experiments show the effectiveness of our method.
A comprehensive analysis of parameter settings for novelty-biased cumulative gain BIBAFull-Text 1950-1954
  Teerapong Leelanupab; Guido Zuccon; Joemon M. Jose
In the TREC Web Diversity track, novelty-biased cumulative gain (α-NDCG) is one of the official measures to assess retrieval performance of IR systems. The measure is characterised by a parameter, α, the effect of which has not been thoroughly investigated. We find that common settings of α, i.e. α=0.5, may prevent the measure from behaving as desired when evaluating result diversification. This is because it excessively penalises systems that cover many intents while it rewards those that redundantly cover only few intents. This issue is crucial since it highly influences systems at top ranks. We revisit our previously proposed threshold, suggesting α be set on a query-basis. The intuitiveness of the measure is then studied by examining actual rankings from TREC 09-10 Web track submissions. By varying α according to our query-based threshold, the discriminative power of α-NDCG is not harmed and in fact, our approach improves α-NDCG's robustness. Experimental results show that the threshold for α can turn the measure to be more intuitive than using its common settings.
Entity centric query expansion for enterprise search BIBAFull-Text 1955-1959
  Xitong Liu; Hui Fang; Fei Chen; Min Wang
Enterprise search is important, and the search quality has a direct impact on the productivity of an enterprise. Many information needs of enterprise search center around entities. Intuitively, information related to the entities mentioned in the query, such as related entities, would be useful to reformulate the query and improve the retrieval performance. However, most existing studies on query expansion are term-centric. In this paper, we propose a novel entity-centric query expansion framework for enterprise search. Specifically, given a query containing entities, we first utilize both unstructured and structured information to find entities that are related to the ones in the query. We then discuss how to adapt existing feedback methods to use the related entities to improve search quality. Experiment results show that the proposed entity-centric query expansion strategy is more effective to improve the search performance than the state-of-the-art pseudo feedback methods on longer, natural language-like queries with entities.
Location-sensitive resources recommendation in social tagging systems BIBAFull-Text 1960-1964
  Chang Wan; Ben Kao; David W. Cheung
In social tagging systems, resources such as images and videos are annotated with descriptive words called tags. It has been shown that tag-based resource searching and retrieval is much more effective than content-based retrieval. With the advances in mobile technology, many resources are also geo-tagged with location information. We observe that a traditional tag (word) can carry different semantics at different locations. We study how location information can be used to help distinguish the different semantics of a resource's tags and thus to improve retrieval accuracy. Given a search query, we propose a location-partitioning method that partitions all locations into regions such that the user query carries distinguishing semantics in each region. Based on the identified regions, we utilize location information in estimating the ranking scores of resources for the given query. These ranking scores are learned using the Bayesian Personalized Ranking (BPR) framework. Two algorithms, namely, LTD and LPITF, which apply Tucker Decomposition and Pairwise Interaction Tensor Factorization, respectively for modeling the ranking score tensor are proposed. Through experiments on real datasets, we show that LTD and LPITF outperform other tag-based resource retrieval methods.
Differences in effectiveness across sub-collections BIBAFull-Text 1965-1969
  Mark Sanderson; Andrew Turpin; Ying Zhang; Falk Scholer
The relative performance of retrieval systems when evaluated on one part of a test collection may bear little or no similarity to the relative performance measured on a different part of the collection. In this paper we report the results of a detailed study of the impact that different sub-collections have on retrieval effectiveness, analyzing the effect over many collections, and with different approaches to sub-dividing the collections. The effect is shown to be substantial, impacting on comparisons between retrieval runs that are statistically significant. Some possible causes for the effect are investigated, and the implications of this work are examined for test collection design and for the strength of conclusions one can draw from experimental results.
Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries BIBAFull-Text 1970-1974
  Mihai Georgescu; Dang Duc Pham; Claudiu S. Firan; Wolfgang Nejdl; Julien Gaugaz
Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed -- with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.
Full-text citation analysis: enhancing bibliometric and scientific publication ranking BIBAFull-Text 1975-1979
  Xiaozhong Liu; Jinsong Zhang; Chun Guo
The goal of this paper is to use innovative text and graph mining algorithms along with full-text citation analysis and topic modeling to enhance classical bibliometric analysis and publication ranking. By utilizing citation contexts extracted from a large number of full-text publications, each citation or publication is represented by a probability distribution over a set of predefined topics, where each topic is labeled by an author contributed keyword. We then used publication/citation topic distribution to generate a citation graph with vertex prior and edge transitioning probability distributions. The publication importance score for each given topic is calculated by PageRank with edge and vertex prior distributions. Based on 104 topics (labeled with keywords) and their review papers, the cited publications of each review paper are assumed as "important publications" for ranking evaluation. The result shows that full text citation and publication content prior topic distribution along with the PageRank algorithm can significantly enhance bibliometric analysis and scientific publication ranking performance for academic IR system.
Detecting offensive tweets via topical feature discovery over a large scale Twitter corpus BIBAFull-Text 1980-1984
  Guang Xiang; Bin Fan; Ling Wang; Jason Hong; Carolyn Rose
In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using automatically these generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching baseline, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our approach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.
Automatic query expansion based on tag recommendation BIBAFull-Text 1985-1989
  Vitor Oliveira; Guilherme Gomes; Fabiano Belém; Wladmir Brandão; Jussara Almeida; Nivio Ziviani; Marcos Gonçalves
We here propose a new method for expanding entity related queries that automatically filters, weights and ranks candidate expansion terms extracted from Wikipedia articles related to the original query. Our method is based on state-of-the-art tag recommendation methods that exploit heuristic metrics to estimate the descriptive capacity of a given term. Originally proposed for the context of tags, we here apply these recommendation methods to weight and rank terms extracted from multiple fields of Wikipedia articles according to their relevance for the article. We evaluate our method comparing it against three state-of-the-art baselines in three collections. Our results indicate that our method outperforms all baselines in all collections, with relative gains in MAP of up to 14% against the best ones.
The downside of markup: examining the harmful effects of CSS and javascript on indexing today's web BIBAFull-Text 1990-1994
  Karl Gyllstrom; Carsten Eickhoff; Arjen P. de Vries; Marie-Francine Moens
The continued development and maturation of advanced HTML features such as Cascading style sheets (CSS), Javascript, and AJAX, as well as their widespread adoption by browsers, has enabled web pages to flourish with sophistication and interactivity. Unfortunately, this presents challenges to the web search community, as a web page's representation in the browser (i.e., what users see) can diverge dramatically from its raw HTML content (i.e., what search engines index and retrieve). For example, interactive pages may contain content in regions that are not visible before a user action, such as focusing a tab, but which are nonetheless still contained within the raw HTML. We study this divergence by comparing raw HTML to its fully rendered form across a number of metrics spanning presentation, geometry, and content, using a large, representative sample of popular web pages. We find that a large divergence currently exists, and we show via a historical analysis that this divergence has grown more pronounced over the last decade. The general finding of our study is that continuing to index the web via simple HTML parsing will diminish the effectiveness of retrieval on the modern web, and that the IR community should work toward more sophisticated web page processing in indexing technology.
You should read this! let me explain you why: explaining news recommendations to users BIBAFull-Text 1995-1999
  Roi Blanco; Diego Ceccarelli; Claudio Lucchese; Raffaele Perego; Fabrizio Silvestri
Recommender systems have become ubiquitous in content-based web applications, from news to shopping sites. Nonetheless, an aspect that has been largely overlooked so far in the recommender system literature is that of automatically building explanations for a particular recommendation. This paper focuses on the news domain, and proposes to enhance effectiveness of news recommender systems by adding, to each recommendation, an explanatory statement to help the user to better understand if, and why, the item can be her interest. We consider the news recommender system as a black-box, and generate different types of explanations employing pieces of information associated with the news. In particular, we engineer text-based, entity-based, and usage-based explanations, and make use of a Markov Logic Networks to rank the explanations on the basis of their effectiveness. The assessment of the model is conducted via a user study on a dataset of news read consecutively by actual users. Experiments show that news recommender systems can greatly benefit from our explanation module as it allows users to discriminate between interesting and not interesting news in the majority of the cases.
Characterizing web search queries that match very few or no results BIBAFull-Text 2000-2004
  Ismail Sengor Altingovde; Roi Blanco; Berkant Barla Cambazoglu; Rifat Ozcan; Erdem Sarigil; Özgür Ulusoy
Despite the continuous efforts to improve the web search quality, a non-negligible fraction of user queries end up with very few or even no matching results in leading web search engines. In this work, we provide a detailed characterization of such queries based on an analysis of a real-life query log. Our experimental setup allows us to characterize the queries with few/no results and compare the mechanisms employed by the major search engines in handling them.
A unified optimization framework for auction and guaranteed delivery in online advertising BIBAFull-Text 2005-2009
  Konstantin Salomatin; Tie-Yan Liu; Yiming Yang
This paper proposes a new unified optimization framework combining pay-per-click auctions and guaranteed delivery in sponsored search. Advertisers usually have different (and sometimes mixed) marketing goals: brand awareness and direct response. Different mechanisms are good at addressing different goals, e.g., guaranteed delivery was often used to build brand awareness and pay-per-click auctions was widely used for direct marketing. Our new method accommodates both in a unified framework, with the search engine revenue as an optimization objective. In this way, we can target a guaranteed number of ad clicks (or impressions) per campaign for advertisers willing to pay a premium and enable keyword auctions for all others. Specifically, we formulate this joint optimization problem using linear programming and a column generation strategy for efficiency. To select the best column (a ranked list of ads) given a query, we propose a novel dynamic programming algorithm that takes the special structure of the ad allocation and pricing mechanisms into account. We have tested the proposed framework and the algorithms on real ad data obtained from a commercial search engine. The results demonstrate that our proposed approach can outperform several baselines in guaranteeing the number of clicks for the given advertisers, and in increasing the total revenue for the search engine.
Query recommendation for children BIBAFull-Text 2010-2014
  Sergio Duarte Torres; Djoerd Hiemstra; Ingmar Weber; Pavel Serdyukov
One of the biggest problems that children experience while searching the web occurs during the query formulation process. Children have been found to struggle formulating queries based on keywords given their limited vocabulary and their difficulty to choose the right keywords.
   In this work we propose a method that utilizes tags from social media to suggest queries related to children topics. Concretely we propose a simple yet effective approach to bias a random walk defined on a bipartite graph of web resources and tags through keywords that are more commonly used to describe resources for children.
   We evaluate our method using a large query log sample of queries aimed at retrieving information for children. We show that our method outperforms query suggestions of state-of-the-art search engines and state-of-the art query suggestions based on random walks.
Modeling browsing behavior for click analysis in sponsored search BIBAFull-Text 2015-2019
  Azin Ashkan; Charles L. A. Clarke
Clickthrough rate provides a fundamental measure of advertising quality, which is widely used in ad selection strategies. However, ads placed in contexts where they are rarely viewed, or where users are unlikely to be interested in commercial results, may receive few clicks regardless of their quality. In this paper, we gain insight into user browsing and click behavior for the purpose of click analysis in sponsored search domain. The list of ads displayed on a page, the user's initial motivation to browse this list, and the persistence of the user are among the contextual factors considered in this paper. We propose a probabilistic model for user's browsing and click behavior using these contextual factors. To evaluate the performance of the model, we compare it with state-of-the-art methods. The experimental results confirm that these contextual factors can better reflect user browsing and click behavior in sponsored search.
Sentiment-focused web crawling BIBAFull-Text 2020-2024
  A. Gural Vural; B. Barla Cambazoglu; Pinar Senkul
The sentiments and opinions that are expressed in web pages towards objects, entities, and products constitute an important portion of the textual content available in the Web. Despite the vast interest in sentiment analysis and opinion mining, somewhat surprisingly, the discovery of the sentimental or opinionated web content is mostly ignored. This work aims to fill this gap and address the problem of quickly discovering and fetching the sentimental content present in the Web. To this end, we design a sentiment-focused web crawling framework for faster discovery and retrieval of such content. In particular, we propose different sentiment-focused web crawling strategies that prioritize discovered URLs based on their predicted sentiment scores. Through simulations, these strategies are shown to achieve considerable performance improvement over general-purpose web crawling strategies in discovering sentimental content.
User guided entity similarity search using meta-path selection in heterogeneous information networks BIBAFull-Text 2025-2029
  Xiao Yu; Yizhou Sun; Brandon Norick; Tiancheng Mao; Jiawei Han
With the emergence of web-based social and information applications, entity similarity search in information networks, aiming to find entities with high similarity to a given query entity, has gained wide attention. However, due to the diverse semantic meanings in heterogeneous information networks, which contain multi-typed entities and relationships, similarity measurement can be ambiguous without context. In this paper, we investigate entity similarity search and the resulting ambiguity problems in heterogeneous information networks. We propose to use a meta-path-based ranking model ensemble to represent semantic meanings for similarity queries, exploit the possibility of using user-guidance to understand users query. Experiments on real-world datasets show that our framework significantly outperforms competitor methods.
User activity profiling with multi-layer analysis BIBAFull-Text 2030-2034
  Hongxia Jin
In this paper, we are interested in discovering semantically meaningful communities from a single user's perspective. We define a multi-layer analysis problem to derive a user's activity profile. Such an activity profile would include what activity areas a user is involved with, how important each activity is to the user, and who else is involved with the user on each activity as well as each participant's participation level. We believe a semantically meaningful community (corresponding to an activity area) must also consider the topics of the social messages rather than only the social links. While it is possible to use a hybrid approach based on traditional topic modeling, in this paper we propose a unified user modeling approach based on direct clustering over the social messages taking into considerations of both social connections and topics of social messages. Our clustering algorithm can be performed in a unified way in a unsupervised fashion as well as semi-supervised fashion when the user wants to give our algorithm some seeding inputs on his viewpoints. Moreover, when the new data comes, our algorithm can perform incremental updates on the new data without re-clustering the old data. Our experiments on social media datasets available from both within an enterprise and public social network demonstrate the effectiveness of our approach.
GTE: a distributional second-order co-occurrence approach to improve the identification of top relevant dates in web snippets BIBAFull-Text 2035-2039
  Ricardo Campos; Gaël Dias; Alípio Jorge; Célia Nunes
In this paper, we present an approach to identify top relevant dates in Web snippets with respect to a given implicit temporal query. Our approach is two-fold. First, we propose a generic temporal similarity measure called GTE, which evaluates the temporal similarity between a query and a date. Second, we propose a classification model to accurately relate relevant dates to their corresponding query terms and withdraw irrelevant ones. We suggest two different solutions: a threshold-based classification strategy and a supervised classifier based on a combination of multiple similarity measures. We evaluate both strategies over a set of real-world text queries and compare the performance of our Web snippet approach with a query log approach over the same set of queries. Experiments show that determining the most relevant dates of any given implicit temporal query can be improved with GTE combined with the second order similarity measure InfoSimba, the Dice coefficient and the threshold-based strategy compared to (1) first-order similarity measures and (2) the query log based approach.
Stochastic simulation of time-biased gain BIBAFull-Text 2040-2044
  Mark D. Smucker; Charles L. A. Clarke
Time-biased gain provides a unifying framework for information retrieval evaluation, generalizing many traditional effectiveness measures while accommodating aspects of user behavior not captured by these measures. By using time as a basis for calibration against actual user data, time-biased gain can reflect aspects of the search process that directly impact user experience, including document length, near-duplicate documents, and summaries. Unlike traditional measures, which must be arbitrarily normalized for averaging purposes, time-biased gain is reported in meaningful units, such as the total number of relevant documents seen by the user. In prior work, we proposed and validated a closed-form equation for estimating time-biased gain, explored its properties, and compared it to standard approaches. In this paper, we use stochastic simulation to numerically approximate time-biased gain. Stochastic simulation provides greater flexibility that will allow us, in future work, to easily accommodate different types of user behavior and increase the realism of the effectiveness measure.
SonetRank: leveraging social networks to personalize search BIBAFull-Text 2045-2049
  Abhijith Kashyap; Reza Amini; Vagelis Hristidis
Earlier works on personalized Web search focused on the click-through graphs, while recent works leverage social annotations, which are often unavailable. On the other hand, many users are members of the social networks and subscribe to social groups. Intuitively, users in the same group may have similar relevance judgments for queries related to these groups. SonetRank utilizes this observation to personalize the Web search results based on the aggregate relevance feedback of the users in similar groups. SonetRank builds and maintains a rich graph-based model, termed Social Aware Search Graph, consisting of groups, users, queries and results click-through information. SonetRank's personalization scheme learns in a principled way to leverage the following three signals, of decreasing strength: the personal document preferences of the user, of the users of her social groups relevant to the query, and of the other users in the network. SonetRank also uses a novel approach to measure the amount of personalization with respect to a user and a query, based on the query-specific richness of the user's social profile. We evaluate SonetRank with users on Amazon Mechanical Turk and show a significant improvement in ranking compared to state-of-the-art techniques.
Predicting web search success with fine-grained interaction data BIBAFull-Text 2050-2054
  Qi Guo; Dmitry Lagun; Eugene Agichtein
Detecting and predicting searcher success is essential for automatically evaluating and improving Web search engine performance. In the past, Web searcher behavior data, such as result clickthrough, dwell time, and query reformulation sequences, have been successfully used for a variety of tasks, including prediction of success in a search session. However, the effectiveness of the previous approaches has been limited, as they tend to ignore how searchers actually view and interact with the visited pages. We show that fine-grained interactions, such as mouse cursor movements and scrolling, provide additional clues for better predicting success of a search session as a whole. To this end, we identify patterns of examination and interaction behavior that correspond to search success, and design a new Fine-grained Session Behavior (FSB) model to capture these patterns. Our experimental results show that FSB is significantly more effective than the state-of-the-art approaches that do not use these additional interaction data.
Multi-session re-search: in pursuit of repetition and diversification BIBAFull-Text 2055-2059
  Sarah K. Tyler; Yi Zhang
Search engine users regularly re-issue queries that are the same or similar to ones they have previously issued. In this paper we study this act of query re-issuing, called re-search, focusing on multi session re-searching from an information seeking perspective. By focusing on the series of repeat or similar queries where the user shows a continued interest, new patterns of behavior not previously seen arise. We find that the well-studied re-finding behavior is only a piece of the re-search puzzle, and that even amidst repeated re-findings users exhibit diversification and novelty seeking behaviours for many re-search queries. This suggests diversity and re-finding behaviors should be jointly modelled and captured in evaluation measures, instead of being studied as two separate problems as is seen in many previous approaches.
Mining sentiment terminology through time BIBAFull-Text 2060-2064
  Hadi Amiri; Tat-Seng Chua
The correspondence between sentiment terminology and the active language used for expressing opinions is a crucial prerequisite for effective sentiment analysis. Mining sentiment terminology includes the detection of new opinion words as well as inferring their polarities. In this paper, we first propose a novel approach based on the interchangeability characteristic of words to detect new opinion words through time. We then show that the current non-time-based polarity inference approaches may assign opposite polarity to the same opinion word at different times. To tackle this issue, we consider the polarity scores computed at different times as polarity evidences (with the possibility of flawed evidences) and combine them to compute a globally correct polarity score for each opinion word. The experiments show that our approach is effective both in terms of the quality of the discovered new opinion words as well as its ability in inferring their polarities through time. Furthermore, we show the application of mining sentiment terminology through time in the sentiment classification (SC) task. The experiments show that mining more recent new opinion words leads to greater improvement in the performance of SC. To the best of our knowledge, this is the first work that investigates "time" as an important factor in mining sentiment terminology.
Theme chronicle model: chronicle consists of timestamp and topical words over each theme BIBAFull-Text 2065-2069
  Noriaki Kawamae
This paper presents a topic model that discovers the correlation patterns in a given time-stamped document collection and how these patterns evolve over time. Our proposal, the theme chronicle model (TCM) divides traditional topics into temporal and stable topics to detect the change of each theme over time; previous topic models ignore these differences and characterize trends as merely bursts of topics.
   TCM introduces a theme topic (stable topic), a trend topic (temporal topic), timestamps, and a latent switch variable in each token to realize these differences. Its topic layers allow TCM to capture not only word co-occurrence patterns in each theme, but also word co-occurrence patterns at any given time in each theme as trends. Experiments on various data sets show that the proposed model is useful as a generative model to discover fine-grained tightly coherent topics, takes advantage of previous models, and then assigns values for new documents.
Fast top-k similarity queries via matrix compression BIBAFull-Text 2070-2074
  Yucheng Low; Alice X. Zheng
In this paper, we propose a novel method to efficiently compute the top-K most similar items given a query item, where similarity is defined by the set of items that have the highest vector inner products with the query. The task is related to the classical k-Nearest-Neighbor problem, and is widely applicable in a number of domains such as information retrieval, online advertising and collaborative filtering. Our method assumes an in-memory representation of the dataset and is designed to scale to query lengths of 100,000s of terms. Our algorithm uses a generalized Holder's inequality to upper bound the inner product with the norms of the constituent vectors. We also propose a novel compression scheme that computes bounds for groups of candidate items, thereby speeding up computation and minimizing memory requirements per query. We conduct extensive experiments on the publicly available Wikipedia dataset, and demonstrate that, with a memory overhead of 21%, our method can provide 1-3 orders of magnitude improvement in query run-time compared to naive methods and state of the art competing methods. Our median top-10 word query time is 25 us on 7.5 million words and 2.3 million documents.

Databases short paper session

Top-k retrieval using conditional preference networks BIBAFull-Text 2075-2079
  Hongbing Wang; Xuan Zhou; Wujin Chen; Peisheng Ma
This paper considers top-k retrieval using Conditional Preference Network (CP-Net). As a model for expressing user preferences on multiple mutually correlated attributes, CP-Net is of great interest for decision support systems. However, little work has addressed how to conduct efficient data retrieval using CP-Nets. This paper presents an approach to efficiently retrieve the most preferred data items based on a user's CP-Net. The proposed approach consists of a top-k algorithm and an indexing scheme. We conducted extensive experiments to compare our approach against a baseline top-k method -- sequential scan. The results show that our approach outperform sequential scan in several circumstances.
Sort-based query-adaptive loading of R-trees BIBAFull-Text 2080-2084
  Daniar Achakeev; Bernhard Seeger; Peter Widmayer
Bulk-loading of R-trees has been an important problem in academia and industry for more than twenty years. Current algorithms create R-trees without any information about the expected query profile. However, query profiles are extremely useful for the design of efficient indexes. In this paper, we address this deficiency and present query-adaptive algorithms for building R-trees optimally designed for a given query profile. Since optimal R-tree loading is NP-hard (even without tuning the structure to a query profile), we provide efficient, easy to implement heuristics. Our sort-based algorithms for query-adaptive loading consist of two steps: First, sorting orders are identified resulting in better R-trees than those obtained from standard space-filling curves. Second, for a given sorting order, we propose a dynamic programming algorithm for generating R-trees in linear runtime. Our experimental results confirm that our algorithms generally create significantly better R-trees than the ones obtained from standard sort-based loading algorithms, even when the query profile is unknown.
Efficient logging for enterprise workloads on column-oriented in-memory databases BIBAFull-Text 2085-2089
  Johannes Wust; Joos-Hendrick Boese; Frank Renkes; Sebastian Blessing; Jens Krueger; Hasso Plattner
The introduction of a 64 bit address space in commodity operating systems and the constant drop in hardware prices made large capacities of main memory in the order of terabytes technically feasible and economically viable. Especially column-oriented in-memory databases are a promising platform to improve data management for enterprise applications. As in-memory databases hold the primary persistence in volatile memory, some form of recovery mechanism is required to prevent potential data loss in case of failures. Two desirable characteristics of any recovery mechanism are (1) that it has a minimal impact on the running system, and (2) that the system recovers quickly and without any data loss after a failure. This paper introduces an efficient logging mechanism for dictionary-compressed column structures that addresses these two characteristics by (1) reducing the overall log size by writing dictionary-compressed values and (2) allowing for parallel writing and reading of log files. We demonstrate the efficiency of our logging approach by comparing the resulting log-file size with traditional logical logging on a workload produced by a productive enterprise system.
Schema-free structured querying of DBpedia data BIBAFull-Text 2090-2093
  Lushan Han; Tim Finin; Anupam Joshi
We need better ways to query large linked data collections such as DBpedia. Using the SPARQL query language requires not only mastering its syntax but also understanding the RDF data model, large ontology vocabularies and URIs for denoting entities. Natural language interface systems address the problem, but are still subjects of research. We describe a compromise in which non-experts specify a graphical query "skeleton" and annotate it with freely chosen words, phrases and entity names. The combination reduces ambiguity and allows the generation of an interpretation that can be translated into SPARQL. Key research contributions are the robust methods that combine statistical association and semantic similarity to map user terms to the most appropriate classes and properties in the underlying ontology.
Discovering conditional inclusion dependencies BIBAFull-Text 2094-2098
  Jana Bauckmann; Ziawasch Abedjan; Ulf Leser; Heiko Müller; Felix Naumann
Data dependencies are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. Conditional dependencies have been introduced to analyze and improve data quality. A conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.
Diversifying query results on semi-structured data BIBAFull-Text 2099-2103
  Mahbub Hasan; Abdullah Mueen; Vassilis Tsotras; Eamonn Keogh
Queries on the web can easily result in a large number of results. Result Diversification, a process by which the query provides the k most diverse set of matches, enables the user to better understand/explore such large results. Computing the diverse subset from a large set of results needs a massive number of pair-wise distance computations as well as finding the subset that maximizes the total pair-wise distance, which is NP-hard and requires efficient approximate algorithm.
   The problem becomes more difficult when querying semi-structured data, since diversity can occur not only in the document content but also (and more importantly) in the document structure; thus one needs to efficiently measure the structural differences between results. The tree edit distance is the standard choice but, is too expensive for large result sets. Moreover, the generalized tree edit distance ignores the context of the query and also the content of the documents resulting in poor diversification. We present a novel algorithm for meaningful diversification that considers both the structural context of the query and the content of the matched results while computing pair-wise distances. Our algorithm is an order of magnitude faster than the tree edit distance with an elegant worst case guarantee.
   We also present a novel algorithm that finds the top-k diverse subset of matches in time linear on the size of the result-set. We experimentally demonstrate the utility of our algorithms as a plugin for standard query processors without introducing large error and latency to the output.
LINDA: distributed web-of-data-scale entity matching BIBAFull-Text 2104-2108
  Christoph Böhm; Gerard de Melo; Felix Naumann; Gerhard Weikum
Linked Data has emerged as a powerful way of interconnecting structured data on the Web. However, the cross-linkage between Linked Data sources is not as extensive as one would hope for. In this paper, we formalize the task of automatically creating "sameAs" links across data sources in a globally consistent manner. Our algorithm, presented in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match. Experiments confirm that our system scales beyond 100 million entities and delivers highly accurate results despite the vast heterogeneity and daunting scale.
SliceSort: efficient sorting of hierarchical data BIBAFull-Text 2109-2113
  Quoc Trung Tran; Chee-Yong Chan
Sorting is a fundamental operation in data processing. While the problem of sorting flat data records has been extensively studied, there is very little work on sorting hierarchical data such as XML documents. Existing hierarchy-aware sorting approaches for hierarchical data are based on creating sorted subtrees as initial sorted runs and merging sorted subtrees to create the sorted output using either explicit pointers or absolute node key comparisons for merging subtrees. In this paper, we propose SliceSort, a novel, level-wise sorting technique for hierarchical data that avoids the drawbacks of subtree-based sorting techniques. Our experimental performance evaluation shows that SliceSort outperforms the state-of-art approach, HErMeS, by up to a factor of 27%.
Efficient buffer management for piecewise linear representation of multiple data streams BIBAFull-Text 2114-2118
  Qing Xie; Jia Zhu; Mohamed A. Sharaf; xiaofang zhou; Chaoyi Pang
Piecewise Linear Representation (PLR) has been a widely used method for approximating data streams in the form of compact line segments. The buffer-based approach to PLR enables a semi-global approximation which relies on the aggregated processing of batches of streamed data so that to adjust and improve the approximation results. However, one challenge towards applying the buffer-based approach is allocating the necessary memory resources for stream buffering. This challenge is further complicated in a multi-stream environment where multiple data streams are competing for the available memory resources, especially in resource-constrained systems such as sensors and mobile devices.
   In this paper, we address precisely those challenges mentioned above and propose efficient buffer management techniques for the PLR of multiple data streams. In particular, we propose a new dynamic approach called Dynamic Buffer Management with Error Monitoring (DBMEM), which leverages the relationship between the buffer demands of each data stream and its exhibited pattern of data values towards estimating its sufficient buffer size. This enables DBMEM to provide a global buffer allocation strategy that maximizes the overall PLR approximation quality for multiple data streams as shown by our experimental results.
On skyline groups BIBAFull-Text 2119-2123
  Chengkai Li; Nan Zhang; Naeemul Hassan; Sundaresan Rajasekaran; Gautam Das
We formulate and investigate the novel problem of finding the skyline k-tuple groups from an n-tuple dataset -- i.e., groups of k tuples which are not dominated by any other group of equal size, based on aggregate-based group dominance relationship. The major technical challenge is to identify effective anti-monotonic properties for pruning the search space of skyline groups. To this end, we show that the anti-monotonic property in the well-known Apriori algorithm does not hold for skyline group pruning. We then identify order-specific property which applies to SUM, MIN, and MAX and weak candidate-generation property which applies to MIN and MAX only. Experimental results on both real and synthetic datasets verify that the proposed algorithms achieve orders of magnitude performance gain over a baseline method.
Finding the optimal path over multi-cost graphs BIBAFull-Text 2124-2128
  Yajun Yang; Jeffrey Xu Yu; Hong Gao; Jianzhong Li
Shortest path query is an important problem in graphs and has been well-studied. However, most approaches for shortest path query are based on single-cost (weight) graphs. In this paper, we introduce the definition of multi-cost graph and study a novel query: the optimal path query over multi-cost graphs. We propose a best-first branch and bound search algorithm with two optimizing strategies. Furthermore, we propose a novel index named k-cluster index to make our method more space and time efficient for large graphs. We discuss how to construct and utilize k-cluster index. We confirm the effectiveness and efficiency of our algorithms using real-life datasets in experiments.
An efficient index for massive IOT data in cloud environment BIBAFull-Text 2129-2133
  Youzhong Ma; Jia Rao; Weisong Hu; Xiaofeng Meng; Xu Han; Yu Zhang; Yunpeng Chai; Chunqiu Liu
The Internet of Things (IOT) has been widely applied in many fields, while the IOT data are always large volume, update frequently and inherently multi-dimensional, these characteristics bring big challenges to the traditional DBMSs. The traditional DBMSs have rich functionality and can deal with multi-attributes access efficiently, they can not scale good enough to deal with large volume data and can not support high insert throughput. The cloud-based database systems have good scalability, but they don't support multi-dimensional access natively. In order to deal with the large volume of IOT data, we propose an update and query efficient index framework (UQE-Index) based on key-value store that can support high insert throughput and provide efficient multi-dimensional query simultaneously. We implemented a prototype based on HBase and did comprehensive experiments to test our solution's scalability and efficiency.
Clustering Wikipedia infoboxes to discover their types BIBAFull-Text 2134-2138
  Thanh Hoang Nguyen; Huong Dieu Nguyen; Viviane Moreira; Juliana Freire
Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.
CloST: a hadoop-based storage system for big spatio-temporal data analytics BIBAFull-Text 2139-2143
  Haoyu Tan; Wuman Luo; Lionel M. Ni
During the past decade, various GPS-equipped devices have generated a tremendous amount of data with time and location information, which we refer to as big spatio-temporal data. In this paper, we present the design and implementation of CloST, a scalable big spatio-temporal data storage system to support data analytics using Hadoop. The main objective of CloST is to avoid scan the whole dataset when a spatio-temporal range is given. To this end, we propose a novel data model which has special treatments on three core attributes including an object id, a location and a time. Based on this data model, CloST hierarchically partitions data using all core attributes which enables efficient parallel processing of spatio-temporal range scans. According to the data characteristics, we devise a compact storage structure which reduces the storage size by an order of magnitude. In addition, we proposes scalable bulk loading algorithms capable of incrementally adding new data into the system. We conduct our experiments using a very large GPS log dataset and the results show that CloST has fast data loading speed, desirable scalability in query processing, as well as high data compression ratio.
Keyword-based k-nearest neighbor search in spatial databases BIBAFull-Text 2144-2148
  Guoliang Li; Jing Xu; Jianhua Feng
With the ever-increasing number of spatio-textual objects, many applications require to find objects close to a given query point in spatial databases. In this paper, we study the problem of keyword-based k-nearest neighbor search in spatial databases, which, given a query point and a set of keywords, finds k-nearest neighbors of the query point that contain all query keywords. To efficiently answer such queries, we propose a new indexing framework by integrating a spatial component and a textual component, which can efficiently prune search space in terms of both spatial information and textual descriptions. We develop effective index structures and pruning techniques to improve query performance. Experimental results show that our approach significantly outperforms state-of-the-art methods.
Credibility-based product ranking for C2C transactions BIBAFull-Text 2149-2153
  Rong Zhang; Chao Feng Sha; Min Qi Zhou; Ao Ying Zhou
A fundamental issue for C2C transactions is how to rank the products based on the reviews written by the previous customers. In this paper, we present an approach to improve products ranking by tackling the noisy ratings that exist in the practical systems. The first problem is the credibility of the customers. We design an iterative algorithm to measure the customer credibility. In the algorithm, we use a feedback strategy to increase or decrease the customer credibility. We increase the credibility for a customer if the customer gives a high (low) score to a good (bad) product and decrease the value if the customer gives a low (high) score to a good (bad) product. The second problem is the inconsistency between the review comments and scores. To deal with it, we train a classifier on a training data that is constructed automatically. The trained classifier is used to predict the scores of the comments. Finally, we calculate the scores of products by considering the customer credibility and the predicted scores. The experimental results show that our proposed approach provides better products ranking than the baseline systems.
Location selection for utility maximization with capacity constraints BIBAFull-Text 2154-2158
  Yu Sun; Jin Huang; Yueguo Chen; Rui Zhang; Xiaoyong Du
Given a set of client locations, a set of facility locations where each facility has a service capacity, and the assumptions that: (i) a client seeks service from its nearest facility; (ii) a facility provides service to clients in the order of their proximity, we study the problem of selecting all possible locations such that setting up a new facility with a given capacity at these locations will maximize the number of served clients. This problem has wide applications in practice, such as setting up new distribution centers for online sales business and building additional base stations for mobile subscribers. We formulate the problem as location selection query for utility maximization. After applying three pruning rules to a baseline solution,we obtain an efficient algorithm to answer the query. Extensive experiments confirm the efficiency of our proposed algorithm.
Efficient estimation of dynamic density functions with an application to outlier detection BIBAFull-Text 2159-2163
  Abdulhakim Ali Qahtan; Xiangliang Zhang; Suojin Wang
In this paper, we propose a new method to estimate the dynamic density over data streams, named KDE-Track as it is based on a conventional and widely used Kernel Density Estimation (KDE) method. KDE-Track can efficiently estimate the density with linear complexity by using interpolation on a kernel model, which is incrementally updated upon the arrival of streaming data. Both theoretical analysis and experimental validation show that KDE-Track outperforms traditional KDE and a baseline method Cluster-Kernels on estimation accuracy of the complex density structures in data streams, computing time and memory usage. KDE-Track is also demonstrated on timely catching the dynamic density of synthetic and real-world data. In addition, KDE-Track is used to accurately detect outliers in sensor data and compared with two existing methods developed for detecting outliers and cleaning sensor data.
A positional access method for relational databases BIBAFull-Text 2164-2168
  Dongzhe Ma; Jianhua Feng; Guoliang Li
Most commercial database management systems sort tuples of a relation by their primary keys for the purpose of supporting efficient insertions, deletions, and updates. However, primary keys are usually auto-generated integers, which bear little useful information about user data. Secondary indexes have to be created sometimes to help retrieve tuples by columns other than the primary key. Evidently, a better solution is to sort the data by columns that appear frequently in retrieval conditions. Unfortunately, this method does not work, at least not immediately, when the relation is vertically partitioned, which is a popular technique to reduce I/O overhead, since it is difficult to keep tuples of two partitions in exactly the same order unless the sorting columns are replicated, which again wastes storage space and disk bandwidth unnecessarily. In this paper, we introduce a positional access method that allows a partition to be sorted by another one but incurs little storage overhead and provide details about how to improve its performance.
Real-time aggregate monitoring with differential privacy BIBAFull-Text 2169-2173
  Liyue Fan; Li Xiong
Sharing real-time aggregate statistics of private data has given much benefit to the public to perform data mining for understanding important phenomena, such as Influenza outbreaks and traffic congestion. However, releasing time-series data with standard differential privacy mechanism has limited utility due to high correlation between data values. We propose FAST, an adaptive system to release real-time aggregate statistics under differential privacy with improved utility. To minimize overall privacy cost, FAST adaptively samples long time-series according to detected data dynamics. To improve the accuracy of data release per time stamp, filtering is used to predict data values at non-sampling points and to estimate true values from noisy observations at sampling points. Our experiments with three real data sets confirm that FAST improves the accuracy of time-series release and has excellent performance even under very small privacy cost.
Efficient distributed locality sensitive hashing BIBAFull-Text 2174-2178
  Bahman Bahmani; Ashish Goel; Rajendra Shinde
Distributed frameworks are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method of choice, specially when the data is high-dimensional. To guarantee high search quality, the LSH scheme needs a rather large number of hash tables. This entails a large space requirement, and in the distributed setting, with each query requiring a network call per hash bucket look up, also a big network load. Panigrahy's Entropy LSH scheme significantly reduces the space requirement but does not help with (and in fact worsens) the search network efficiency. In this paper, focusing on the Euclidian space under ι2 norm and building up on Entropy LSH, we propose the distributed Layered LSH scheme, and prove that it exponentially decreases the network cost, while maintaining a good load balance between different machines. Our experiments also verify that our theoretical results.
Author-conference topic-connection model for academic network search BIBAFull-Text 2179-2183
  Jianwen Wang; Xiaohua Hu; Xinhui Tu; Tingting He
This paper proposes a novel topic model, Author-Conference Topic-Connection (ACTC) Model for academic network search. The ACTC Model extends the author-conference-topic (ACT) model by adding subject of the conference and the latent mapping information between subjects and topics. It simultaneously models topical aspects of papers, authors and conferences with two latent topic layers: a subject layer corresponding to conference topic, and a topic layer corresponding to the word topic. Each author would be associated with a multinomial distribution over subjects of conference (eg., KM, DB, IR for CIKM 2012), the conference (CIKM 2012), and the topics are respectively generated from a sampled subject. Then the words are generated from the sampled topics. We conduct experiments on a data set with 8,523 authors, 22,487 papers and 1,243 conferences from the well-known Arnetminer website, and train the model with different number of subjects and topics. For a qualitative evaluation, we compare ACTC with three others models LDA, Author-Topic (AT) and ACT in academic search services. Experiments show that ACTC can effectively capture the semantic connection between different types of information in academic network and perform well in expert searching and conference searching.
Impact neighborhood indexing (INI) in diffusion graphs BIBAFull-Text 2184-2188
  Jung Hyun Kim; K. Selçuk Candan; Maria Luisa Sapino
A graph neighborhood consists of a set of nodes that are nearby or otherwise related to each other. While existing definitions consider the structure (or topology) of the graph, we note that they fail to take into account the information propagation and diffusion characteristics, such as decay and reinforcement, common in many networks. In this paper, we first define the propagation efficiency of nodes and edges. We use this to introduce the novel concept of zero-erasure (or impact) neighborhood (ZEN) of a given node, n, consisting of the set of nodes that receive information from (or are impacted by) n without any decay. Based on this, we present an impact neighborhood indexing (INI) algorithm that creates data structures to help quickly identify impact neighborhood of any given node. Experiment results confirm the efficiency and effectiveness of the proposed INI algorithms.
Loyalty-based selection: retrieving objects that persistently satisfy criteria BIBAFull-Text 2189-2193
  Zhitao Shen; Muhammad Aamir Cheema; Xuemin Lin
A traditional query returns a set of objects that satisfy user defined criteria at the time query was issued. The results are based on the values of objects at query time and may be affected by outliers. Intuitively, an object better meets the user's needs if it persistently satisfies the criteria, i.e., it satisfies the criteria for majority of the time in the past T time units. In this paper, we propose a measure named loyalty that reflects how persistently an object satisfies the criteria. Formally, the loyalty of an object is the total time (in past T time units) it satisfies the query criteria. In this paper, we study top-k loyalty queries over sliding windows that continuously report k objects with the highest loyalties. Each object issues an update when it starts satisfying the criteria or when it stops satisfying the criteria. We show that the lower bound cost of updating the results of a top-k loyalty query is O(logN), for each object update, where N is the number of updates issued in last T time units. We conduct a detailed complexity analysis and show that our proposed algorithm is optimal. Moreover, effective pruning techniques are proposed to improve the efficiency. We experimentally verify the effectiveness of the proposed approach by comparing it with a classic sweep line algorithm.
Star-Join: spatio-textual similarity join BIBAFull-Text 2194-2198
  Sitong Liu; Guoliang Li; Jianhua Feng
Location-based services have attracted significant attention due to modern mobile phones equipped with GPS devices. These services generate large amounts of spatio-textual data which contain both spatial location and textual descriptions. Since a spatio-textual object may have different representations, possibly because of deviations of GPS or different user descriptions, it calls for efficient methods to integrate spatio-textual data from different sources. In this paper we study a new research problem called spatio-textual similarity join: given two sets of spatio-textual objects, we find the similar object pairs. To the best of our knowledge, we are the first to study this problem. We make the following contributions: (1) We develop a filter-and-refine framework and devise several efficient algorithms. We first generate spatial and textual signatures for the objects and build inverted index on top of these signatures. Then we generate candidate pairs using the inverted lists of signatures. Finally we refine the candidates and generate the final result. (2) We study how to generate high-quality signatures for spatial information. We develop an MBR-prefix based signature to prune large numbers of dissimilar object pairs. (3) Experimental results on real and synthetic datasets show that our algorithms achieve high performance and scale well.
Adapt: adaptive database schema design for multi-tenant applications BIBAFull-Text 2199-2203
  Jiacai Ni; Guoliang Li; Jun Zhang; Lei Li; Jianhua Feng
Multi-tenant data management is a major application of software as a Service (SaaS). Many companies outsource their data to a third party which hosts a multi-tenant database system to provide data management service. The system should have high performance, low space and excellent scalability. One big challenge is to devise a high-quality database schema. Independent Tables Shared Instances and Shared Tables Shared Instances are two state-of-the-art methods. However, the former has poor scalability, while the latter achieves good scalability at the expense of poor performance and high space overhead. In this paper, we trade-off between the two methods and propose an adaptive database schema design approach to achieve good scalability and high performance with low space. To this end, we identify the important attributes and use them to generate a base table. For other attributes, we construct supplementary tables. We propose a cost-based model to adaptively generate the tables above. Our method has the following advantages. First, our method achieves high scalability. Second, our method can trade-off performance and space requirement. Third, our method can be easily applied to existing databases (e.g., MySQL) with minor revisions. Fourth, our method can adapt to any schemas and query workloads. Experimental results show our method achieves high performance and good scalability with low space and outperforms state-of-the-art method.
Optimizing data migration for cloud-based key-value stores BIBAFull-Text 2204-2208
  Xiulei Qin; Wenbo Zhang; Wei Wang; Jun Wei; Xin Zhao; Tao Huang
As one database offloading strategy, elastic key-value stores are often introduced to speed up the application performance with dynamic scalability. Since the workload is varied, efficient data migration with minimal impact in service is critical for the issue of elasticity and scalability. However, due to the new virtualization technology, real-time and low-latency requirements, data migration within cloud-based key-value stores has to face new challenges: effects of VM interference, and the need to trade off between the two ingredients of migration cost, namely migration time and performance impact. To fulfill these challenges, in this paper we explore a new approach to optimize the data migration. Explicitly, we build two interference-aware models to predict the migration time and performance impact for each migration action using statistical machine learning, and then create a cost model to strike a balance between the two ingredients. Using the load rebalancing scenario as a case study, we have designed one cost-aware migration algorithm that utilizes the cost model to guide the choice of possible migration actions. Finally, we demonstrate the effectiveness of the approach using Yahoo! Cloud Serving Benchmark (YCSB).
Applying weighted queries on probabilistic databases BIBAFull-Text 2209-2213
  Sebastian Lehrack
Relational queries applied on probabilistic databases have been established as a powerful tool for accessing huge data sets of uncertain data. Often various parts of such queries have different significances for a specific user. Thus, a query language should allow us to give subqueries different weights to quantify the individual user preferences. In this work we introduce a theoretical foundation for weighted algebra operators on probabilistic databases within a SQL-like query language.
A new tool for multi-level partitioning in teradata BIBAFull-Text 2214-2218
  Young-Kyoon Suh; Ahmad Ghazal; Alain Crolotte; Pekka Kostamaa
This paper introduces a new tool that recommends an optimized partitioning solution called Multi-Level Partitioned Primary Index (MLPPI) for a fact table based on the queries in the workload. The tool implements a new technique using a greedy algorithm for search space enumeration. The space is driven by predicates in the queries. This technique fits very well the Teradata MLPPI scheme, as it is based on a general framework using general expressions, ranges and case expressions for partition definitions. The cost model implemented in the tool is based on the Teradata optimizer, and it is used to prune the search space for reaching a final solution. The tool resides completely on the client, and interfaces the database through APIs as opposed to previous work that requires optimizer code extension. The APIs are used to simplify the workload queries, and to capture fact table predicates and costs necessary to make the recommendation. The predicate-driven method implemented by the tool is general, and it can be applied to any clustering or partitioning scheme based on simple field expressions or complex SQL predicates. Experimental results given a particular workload will show that the recommendation from the tool outperforms a human expert. The experiments also show that the solution is scalable both with the workload complexity and the size of the fact table.
Fast PCA computation in a DBMS with aggregate UDFs and LAPACK BIBAFull-Text 2219-2223
  Carlos Ordonez; Naveen Mohanam; Carlos Garcia-Alvarado; Predrag T. Tosic; Edgar Martinez
Efficient and scalable execution of numerical methods inside a DBMS is difficult as its architecture is not suited for intense numerical computations. We study computing Principal Component Analysis (PCA) on large data sets via Singular Value Decomposition (SVD). Given the difficulty to program and optimize numerical methods on an existing DBMS, we explore an alternative reusability approach: calling the well-known numerical library LAPACK. Thus we study several alternatives to summarize the data set with aggregate User-Defined Functions (UDFs) and how to efficiently call SVD numerical methods available in LAPACK via Stored Procedures (SPs). We propose algorithmic and system optimizations to enhance scalability and to push processing into RAM. We show it is feasible to efficiently solve PCA by first summarizing the data set with arrays incrementally updated with aggregate UDFs and then pushing heavy matrix processing in SVD to RAM calling LAPACK via SPs. We benchmark our solution on a modern DBMS. Our solution requires only one pass on the data set and it exhibits linear scalability.
Scaling multiple-source entity resolution using statistically efficient transfer learning BIBAFull-Text 2224-2228
  Sahand N. Negahban; Benjamin I. P. Rubinstein; Jim Gemmell Gemmell
We consider a serious, previously-unexplored challenge facing almost all approaches to scaling up entity resolution (ER) to multiple data sources: the prohibitive cost of labeling training data for supervised learning of similarity scores for each pair of sources. While there exists a rich literature describing almost all aspects of pairwise ER, this new challenge is arising now due to the unprecedented ability to acquire and store data from online sources, interest in features driven by ER such as enriched search verticals, and the uniqueness of noisy and missing data characteristics for each source. We show on real-world and synthetic data that for state-of-the-art techniques, the reality of heterogeneous sources means that the number of labeled training data must scale quadratically in the number of sources, just to maintain constant precision/recall. We address this challenge with a brand new transfer learning algorithm which requires far less training data (or equivalently, achieves superior accuracy with the same data) and is trained using fast convex optimization. The intuition behind our approach is to adaptively share structure learned about one scoring problem with all other scoring problems sharing a data source in common. We demonstrate that our theoretically-motivated approach improves upon existing techniques for multi-source ER.
A probabilistic approach to correlation queries in uncertain time series data BIBAFull-Text 2229-2233
  Mahsa Orang; Nematollaah Shiri
Numerous real-life applications, such as wireless sensor networks and location-based services, generate large amount of uncertain time series, where the exact value at each timestamp is unavailable or unknown. In this paper, we formalize the notion of correlation for uncertain time series data and consider a family of probabilistic, threshold-based correlation queries over such data. The proposed formulation extends the notion of correlation developed for standard, certain time series. We show that uncertain correlation is a random variable approaching normal distribution. We also formalize the notion of uncertain time series normalization which is at the core of our correlation query processing approach, while it proves to be an important pre-processing technique in particular for pattern discovery tasks. The results of our numerous experiments indicate that, unlike in the standard time series, there is a trade-off between false alarms and hit ratios, which can be controlled by the probability threshold provided by users. Our results also offer users a guideline for choosing proper threshold values.
On bundle configuration for viral marketing in social networks BIBAFull-Text 2234-2238
  De-Nian Yang; Wang-Chien Lee; Nai-Hui Chia; Mao Ye; Hui-Ju Hung
Prior research on viral marketing mostly focuses on promoting one single product item. In this work, we explore the idea of bundling multiple items for viral marketing and formulate a new research problem, called Bundle Configuration for SpreAd Maximization (BCSAM). Efficiently obtaining an optimal product bundle under the setting of BCSAM is very challenging. Aiming to strike a balance between the quality of solution and the computational overhead, we systematically explore various heuristics to develop a suite of algorithms, including κ-Bundle Configuration and Aggregated Bundle Configuration. Moreover, we integrate all the proposed ideas into one efficient algorithm, called Aggregated Bundle Configuration (ABC). Finally, we conduct an extensive performance evaluation on our proposals. Experimental results show that ABC significantly outperforms its counterpart and two baseline approaches in terms of both computational overhead and bundle quality.

Knowledge management poster session

Learning to rank for hybrid recommendation BIBAFull-Text 2239-2242
  Jiankai Sun; Shuaiqiang Wang; Byron J. Gao; Jun Ma
Most existing recommender systems can be classified into two categories: collaborative filtering and content-based filtering. Hybrid recommender systems combine the advantages of the two for improved recommendation performance. Traditional recommender systems are rating-based. However, predicting ratings is an intermediate step towards their ultimate goal of generating rankings or recommendation lists. Learning to rank is an established means of predicting rankings and has recently demonstrated high promise in improving quality of recommendations. In this paper, we propose LRHR, the first attempt that adapts learning to rank to hybrid recommender systems. LRHR first defines novel representations for both users and items so that they can be content-comparable. Then, LRHR identifies a set of novel meta-level features for learning purposes. Finally, LRHR adopts RankSVM, a pairwise learning to rank algorithm, to generate recommendation lists of items for users. Extensive experiments on benchmarks in comparison with the state-of-the-art algorithms demonstrate the performance gain of our approach.
Importance weighted passive learning BIBAFull-Text 2243-2246
  Shuaiqiang Wang; Xiaoming Xi; Yilong Yin
Importance weighted active learning (IWAL) introduces a weighting scheme to measure the importance of each instance for correcting the sampling bias of the probability distributions between training and test datasets. However, the weighting scheme of IWAL involves the distribution of the test data, which can be straightforwardly estimated in active learning by interactively querying users for labels of selected test instances, but difficult for conventional learning where there are no interactions with users, referred as passive learning. In this paper, we investigate the insufficient sampling bias problem, i.e., bias occurs only because of insufficient samples, but the sampling process is unbiased. In doing this, we present two assumptions on the sampling bias, based on which we propose a practical weighting scheme for the empirical loss function in conventional passive learning, and present IWPL, an importance weighted passive learning framework. Furthermore, we provide IWSVM, an importance weighted SVM for validation. Extensive experiments demonstrate significant advantages of IWSVM on benchmarks and synthetic datasets.
A tag-centric discriminative model for web objects classification BIBAFull-Text 2247-2250
  Lina Yao; Quan Z. Sheng
This paper studies web object classification problem with the novel exploration of social tags. More and more web objects are increasingly annotated with human interpretable labels (i.e., tags), which can be considered as an auxiliary attribute to assist the object classification. Automatically classifying web objects into manageable semantic categories has long been a fundamental pre-process for indexing, browsing, searching, and mining heterogeneous web objects. However, such heterogeneous web objects often suffer from a lack of easy-extractable and uniform descriptive features. In this paper, we propose a discriminative tag-centric model for web object classification by jointly modeling the objects category labels and their corresponding social tags and un-coding the relevance among social tags. Our approach is based on recent techniques for learning large-scale discriminative models. We conduct experiments to validate our approach using real-life data. The results show the feasibility and good performance of our approach.
Outlier detection using centrality and center-proximity BIBAFull-Text 2251-2254
  Duck-Ho Bae; Seo Jeong; Sang-Wook Kim; Minsoo Lee
An outlier is an object that is considerably dissimilar with the remainder of the dataset. In this paper, we first propose the notion of centrality and center-proximity as novel outlierness measures which can be considered to represent the characteristics of all of the objects in the dataset. We then propose a graph-based outlier detection method which can solve the problems of local density, micro-cluster, and fringe objects. Finally, through extensive experiments, we show the effectiveness of the proposed method.
An effective category classification method based on a language model for question category recommendation on a cQA service BIBAFull-Text 2255-2258
  Kyoungman Bae; Youngjoong Ko
Classifying user's question into several topics helps respondents answering the question in a cQA service. The word weighting method must estimate the appropriate weight of a word to improve the category (or topic) classification. In this paper, we propose a novel effective word weighting method based on a language model for automatic category classification in the cQA service. We first calculate the occurrence probability of a word in each category by using a language model and then the final weight of each word is estimated by ratio of the occurrence probability of the word on a category to the occurrence probability of the word on the other categories. As a result, the proposed method significantly improves the performance of the category classification.
Clustering short text using Ncut-weighted non-negative matrix factorization BIBAFull-Text 2259-2262
  Xiaohui Yan; Jiafeng Guo; Shenghua Liu; Xue-qi Cheng; Yanfeng Wang
Non-negative matrix factorization (NMF) has been successfully applied in document clustering. However, experiments on short texts, such as microblogs, Q&A documents and news titles, suggest unsatisfactory performance of NMF. An major reason is that the traditional term weighting schemes, like binary weight and tfidf, cannot well capture the terms' discriminative power and importance in short texts, due to the sparsity of data. To tackle this problem, we proposed a novel term weighting scheme for NMF, derived from the Normalized Cut (Ncut) problem on the term affinity graph. Different from idf, which emphasizes discriminability on document level, the Ncut weighting measures terms' discriminability on term level. Experiments on two data sets show our weighting scheme significantly boosts NMF's performance on short text clustering.
Polygene-based evolution: a novel framework for evolutionary algorithms BIBAFull-Text 2263-2266
  Shuaiqiang Wang; Byron J. Gao; Shuangling Wang; Guibao Cao; Yilong Yin
In this paper, we introduce polygene-based evolution, a novel framework for evolutionary algorithms (EAs) that features distinctive operations in the evolution process. In traditional EAs, the primitive evolution unit is gene, where genes are independent components during evolution. In polygene-based evolutionary algorithms (PGEAs), the evolution unit is polygene, i.e., a set of co-regulated genes. Discovering and maintaining quality polygenes can play an effective role in evolving quality individuals. Polygenes generalize genes, and PGEAs generalize EAs. Implementing the PGEA framework involves three phases: polygene discovery, polygene planting, and polygene-compatible evolution. Extensive experiments on function optimization benchmarks in comparison with the conventional and state-of-the-art EAs demonstrate the potential of the approach in accuracy and efficiency improvement.
A tensor encoding model for semantic processing BIBAFull-Text 2267-2270
  Michael Symonds; Peter D. Bruza; Laurianne Sitbon; Ian Turner
This paper develops and evaluates an enhanced corpus based approach for semantic processing. Corpus based models that build representations of words directly from text do not require pre-existing linguistic knowledge, and have demonstrated psychologically relevant performance on a number of cognitive tasks. However, they have been criticised in the past for not incorporating sufficient structural information. Using ideas underpinning recent attempts to overcome this weakness, we develop an enhanced tensor encoding model to build representations of word meaning for semantic processing. Our enhanced model demonstrates superior performance when compared to a robust baseline model on a number of semantic processing tasks.
Accelerating locality preserving nonnegative matrix factorization BIBAFull-Text 2271-2274
  Guanhong Yao; Cai Deng
Matrix factorization techniques have been frequently applied in information retrieval, computer vision and pattern recognition. Among them, Non-negative Matrix Factorization (NMF) has received considerable attention due to its psychological and physiological interpretation of naturally occurring data whose representation may be parts-based in the human brain. Locality Preserving Non-negative Matrix Factorization (LPNMF) is a recently proposed graph-based NMF extension which tries to preserves the intrinsic geometric structure of the data. Compared with the original NMF, LPNMF has more discriminating power on data representation thanks to its geometrical interpretation and outstanding ability to discover the hidden topics. However, the computational complexity of LPNMF is O(n³), where n is the number of samples. In this paper, we propose a novel approach called Accelerated LPNMF (A-LPNMF) to solve the computational issue of LPNMF. Specifically, A-LPNMF selects p (p j n) landmark points from the data and represents all the samples as the sparse linear combination of these landmarks. The non-negative factors which incorporates the geometric structure can then be efficiently computed. Experimental results on the real data sets demonstrate the effectiveness and efficiency of our proposed method.
The twitaholic next door.: scalable friend recommender system using a concept-sensitive hash function BIBAFull-Text 2275-2278
  Patrick Bamba; Julien Subercaze; Christophe Gravier; Nabil Benmira; Jimi Fontaine
In this paper we present a Friend Recommender System for micro-blogging. Traditional batch processing of massive amounts of data makes it difficult to provide a near-real time friend recommender system or even a system that can properly scale to millions of users. In order to overcome these issues, we have designed a solution that represents user-generated micro posts as a set of pseudo-cliques. These graphs are assigned a hash value using an original Concept-Sensitive Hash function, a new sub-kind of Locally-Sensitive Hash functions. Finally, since the user profiles are represented as a binary footprint, the pairwise comparison of footprints using the Hamming distance provides scalability to the recommender system. The paper goes with an online application relying on a large Twitter dataset, so that the reader can freely experiment the system.
Information propagation in social rating networks BIBAFull-Text 2279-2282
  Priyanka Garg; Irwin King; Michael R. Lyu
The polarity of opinion is a crucial part of information and ignoring the asymmetry between them, can potentially result in an inaccurate estimation of the number of product adoptions and incorrect recommendations. We analyze the propagation patterns of the negative and positive opinions on two real world datasets, Flixster and Epinions, and observe that the presence of negative opinions significantly reduces the number of expressed opinions. To account for the asymmetry between the two kind of opinions, we propose extensions of the two most popular information propagation models, Independent Cascade and Linear Threshold models. The proposed extensions give a tractable influence problem and improves the prediction accuracy of future opinions, by more than 3% on Flixster and 5% on Epinions datasets.
Maximizing revenue from strategic recommendations under decaying trust BIBAFull-Text 2283-2286
  Paul Dütting; Monika Henzinger; Ingmar Weber
Suppose your sole interest in recommending a product to me is to maximize the amount paid to you by the seller for a sequence of recommendations. How should you recommend optimally if I become more inclined to ignore you with each irrelevant recommendation you make? Finding an answer to this question is a key challenge in all forms of marketing that rely on and explore social ties; ranging from personal recommendations to viral marketing.
   We prove that even if the recommendee regains her initial trust on each successful recommendation, the expected revenue the recommender can make over an infinite period due to payments by the seller is bounded. This can only be overcome when the recommendee also incrementally regains trust during periods without any recommendation. Here, we see a connection to "banner blindness," suggesting that showing fewer ads can lead to a higher long-term revenue.
Weighted linear kernel with tree transformed features for malware detection BIBAFull-Text 2287-2290
  Prakash Mandayam Comar; Lei Liu; Sabyasachi Saha; Antonio Nucci; Pang-Ning Tan
Malware detection from network traffic flows is a challenging problem due to data irregularity issues such as imbalanced class distribution, noise, missing values, and heterogeneous types of features. To address these challenges, this paper presents a two-stage classification approach for malware detection. The framework initially employs random forest as a macro-level classifier to separate the malicious from non-malicious network flows, followed by a collection of one-class support vector machine classifiers to identify the specific type of malware. A novel tree-based feature construction approach is proposed to deal with data imperfection issues. As the performance of the support vector machine classifier often depends on the kernel function used to compute the similarity between every pair of data points, designing an appropriate kernel is essential for accurate identification of malware classes. We present a simple algorithm to construct a weighted linear kernel on the tree transformed features and demonstrate its effectiveness in detecting malware from real network traffic data.
Learning to predict the cost-per-click for your ad words BIBAFull-Text 2291-2294
  Chieh-Jen Wang; Hsin-Hsi Chen
In Internet ad campaign, ranking of an ad on search result pages depends on a cost-per-click (CPC) of ad words offered by an advertiser and a quality score estimated by a search engine. Bidding for ad words with a higher CPC is more competitive than bidding for the same ad words with a lower CPC in the ad ranking competition. However, offering a higher CPC will increase a burden on advertisers. In contrast, offering a lower CPC may decrease the exposure rate of their ads. Thus, how to select an appropriate CPC for ad words is indispensable for advertisers. In this paper, we extract different semantic levels of features, such as named entities, topic terminologies, and individual words from a large-scale real-world ad words corpus, and explore various learning based prediction algorithms. The thorough experimental results show that the CPC prediction models considering more ad words semantics achieve better prediction performance, and the prediction model using the support vector regression (SVR) and features from all semantic levels performs the best.
Dual word and document seed selection for semi-supervised sentiment classification BIBAFull-Text 2295-2298
  Shengfeng Ju; Shoushan Li; Yan Su; Guodong Zhou; Yu Hong; Xiaojun Li
Semi-supervised sentiment classification aims to train a classifier with a small number of labeled data (called seed data) and a large amount of unlabeled data. a big advantage of this approach is its saving of annotation effort by using the unlabeled data which is usually freely available. In this paper, we propose an approach to further minimize the annotation effort of semi-supervised sentiment classification by actively selecting the seed data. Specifically, a novel selection strategy is proposed to simultaneously select good words and documents for manual annotation by considering both of their annotation costs and informativeness. Experimental results demonstrate the effectiveness of our approach.
On empirical tradeoffs in large scale hierarchical classification BIBAFull-Text 2299-2302
  Rohit Babbar; Ioannis Partalas; Eric Gaussier; Cecile Amblard
While multi-class categorization of documents has been of research interest for over a decade, relatively fewer approaches have been proposed for large scale taxonomies in which the number of classes range from hundreds of thousand as in Directory Mozilla to over a million in Wikipedia. As a result of ever increasing number of text documents and images from various sources, there is an immense need for automatic classification of documents in such large hierarchies. In this paper, we analyze the tradeoffs between the important characteristics of different classifiers employed in the top down fashion. The properties for relative comparison of these classifiers include, (i) accuracy on test instance, (ii) training time (iii) size of the model and (iv) test time required for prediction. Our analysis is motivated by the well known error bounds from learning theory, which is also further reinforced by the empirical observations on the publicly available data from the Large Scale Hierarchical Text Classification Challenge. We show that by exploiting the data heterogenity across the large scale hierarchies, one can build an overall classification system which is approximately 4 times faster for prediction, 3 times faster to train, while sacrificing only 1% point in accuracy.
An interaction framework of service-oriented ontology learning BIBAFull-Text 2303-2306
  Jingsong Zhang; Yinglin Wang; Hao Wei
Ontology plays a very important role in supporting knowledge-based applications. In cloud computing, ontology learning technology is facing new challenges in dealing with heterogeneous data sources from different domains and researchers, which may contain various particular concepts and relations. Traditional ontology learning frameworks usually focus only on the extraction of concepts and taxonomic relations from the multi-structured corpus. However, former researches rarely studied the interactions during ontology learning process among different researchers. Lack of interactions among people who build ontology in different domains may cause inconsistent ontology. Besides, lack of incentive during the ontology building process will also result in low efficiency. To address these challenges, this paper specifies a novel solution to perform ontology learning. The solution includes a service-oriented ontology interaction framework, a service-oriented ontology learning strategy. It shows that it advances ontology learning to a higher level of performance and portability with a number of experiments in demo system.
Infobox suggestion for Wikipedia entities BIBAFull-Text 2307-2310
  Afroza Sultana; Quazi Mainul Hasan; Ashis Kumer Biswas; Soumyava Das; Habibur Rahman; Chris Ding; Chengkai Li
Given the sheer amount of work and expertise required in authoring Wikipedia articles, automatic tools that help Wikipedia contributors in generating and improving content are valuable. This paper presents our initial step towards building a full-fledged author assistant, particularly for suggesting infobox templates for articles. We build SVM classifiers to suggest infobox template types, among a large number of possible types, to Wikipedia articles without infoboxes. Different from prior works on Wikipedia article classification which deal with only a few label classes for named entity recognition, the much larger 337-class setup in our study is geared towards realistic deployment of infobox suggestion tool. We also emphasize testing on articles without infoboxes, due to that labeled and unlabeled data exhibit different distributions of features, which departs from the typical assumption that they are drawn from the same underlying population.
Time feature selection for identifying active household members BIBAFull-Text 2311-2314
  Pedro G. Campos; Alejandro Bellogin; Fernando Díez; Iván Cantador
Popular online rental services such as Netflix and MoviePilot often manage household accounts. A household account is usually shared by various users who live in the same house, but in general does not provide a mechanism by which current active users are identified, and thus leads to considerable difficulties for making effective personalized recommendations. The identification of the active household members, defined as the discrimination of the users from a given household who are interacting with a system (e.g. an on-demand video service), is thus an interesting challenge for the recommender systems research community. In this paper, we formulate the above task as a classification problem, and address it by means of global and local feature selection methods and classifiers that only exploit time features from past item consumption records. The results obtained from a series of experiments on a real dataset show that some of the proposed methods are able to select relevant time features, which allow simple classifiers to accurately identify active members of household accounts.
Text classification with relatively small positive documents and unlabeled data BIBAFull-Text 2315-2318
  Fumiyo Fukumoto; Takeshi Yamamoto; Suguru Matsuyoshi; Yoshimi Suzuki
This paper addresses the problem of dealing with a collection of negative training documents which is suitable for relatively small number of positive documents, and presents a method for eliminating the need for manually collecting negative training documents based on supervised machine learning techniques. We applied an error correction technique to the results of negative training data obtained by the Positive Example Based Learning (PEBL). Moreover, we used a boosting technique to learn a set of negative data to train classifiers. The results using Japanese newspaper documents showed that the method contributes for reducing the cost of manual collection of negative training documents.
On compressing weighted time-evolving graphs BIBAFull-Text 2319-2322
  Wei Liu; Andrey Kan; Jeffrey Chan; James Bailey; Christopher Leckie; Jian Pei; Ramamohanarao Kotagiri
Existing graph compression techniques mostly focus on static graphs. However for many practical graphs such as social networks the edge weights frequently change over time. This phenomenon raises the question of how to compress dynamic graphs while maintaining most of their intrinsic structural patterns at each time snapshot. In this paper we show that the encoding cost of a dynamic graph is proportional to the heterogeneity of a three dimensional tensor that represents the dynamic graph. We propose an effective algorithm that compresses a dynamic graph by reducing the heterogeneity of its tensor representation, and at the same time also maintains a maximum lossy compression error at any time stamp of the dynamic graph. The bounded compression error benefits compressed graphs in that they retain good approximations of the original edge weights, and hence properties of the original graph (such as shortest paths) are well preserved. To the best of our knowledge, this is the first work that compresses weighted dynamic graphs with bounded lossy compression error at any time snapshot of the graph.
Graph-based collective classification for tweets BIBAFull-Text 2323-2326
  Yajuan Duan; Furu Wei; Ming Zhou; Heung-Yeung Shum
In this paper, we address the problem of classifying tweets into topical categories. Because of the short, noisy and ambiguous nature of tweets, we propose to collectively conduct the classification by exploiting the context information (i.e. related tweets) other than individually as in conventional text classification methods. In particular, we augment the content-based representation of text with tweets sharing same #hashtag or URL, which results in a tweet graph. We then formulate the tweet classification task under a graph optimization framework. We investigate three popular approaches, namely, Loopy Belief Propagation (LBP), Relaxation Labeling (RL), and Iterative Classification Algorithm (ICA). Extensive experiment results show that the graph-based tweet classification approach remarkably improves the performance, while the ICA model with relationship of sharing the same #hashtag gives the best result on separate tweet graph.
A word-order based graph representation for relevance identification BIBAFull-Text 2327-2330
  Lakshmi Ramachandran; Edward F. Gehringer
In this paper we propose a new word-order based graph representation for text. In our graph representation vertices represent words or phrases and edges represent relations between contiguous words or phrases. The graph representation also includes dependency information. Our text representation is suitable for applications involving the identification of relevance or paraphrases across texts, where word-order information would be useful. We show that this word-order based graph representation performs better than a dependency tree representation while identifying the relevance of one piece of text to another.
Tracing clusters in evolving graphs with node attributes BIBAFull-Text 2331-2334
  Brigitte Boden; Stephan Günnemann; Thomas Seidl
Data sources representing social networks with additional attribute information about the nodes are widely available in today's applications. Recently, combined clustering methods were introduced that consider graph information and attribute information simultaneously to detect meaningful clusters in such networks. In many cases, such attributed graphs also evolve over time. Therefore, there is a need for clustering methods that are able to trace clusters over different time steps and analyze their evolution over time. In this paper, we extend our combined clustering method DB-CSC to the analysis of evolving combined clusters.
Prediction of retweet cascade size over time BIBAFull-Text 2335-2338
  Andrey Kupavskii; Liudmila Ostroumova; Alexey Umnov; Svyatoslav Usachev; Pavel Serdyukov; Gleb Gusev; Andrey Kustarev
Retweet cascades play an essential role in information diffusion in Twitter. Popular tweets reflect the current trends in Twitter, while Twitter itself is one of the most important online media. Thus, understanding the reasons why a tweet becomes popular is of great interest for sociologists, marketers and social media researches. What is even more important is the possibility to make a prognosis of a tweet's future popularity. Besides the scientific significance of such possibility, this sort of prediction has lots of practical applications such as breaking news detection, viral marketing etc. In this paper we try to forecast how many retweets a given tweet will gain during a fixed time period. We train an algorithm that predicts the number of retweets during time T since the initial moment. In addition to a standard set of features we utilize several new ones. One of the most important features is the flow of the cascade. Another one is PageRank on the retweet graph, which can be considered as the measure of influence of users.
An efficient and simple under-sampling technique for imbalanced time series classification BIBAFull-Text 2339-2342
  Guohua Liang; Chengqi Zhang
Imbalanced time series classification (TSC) involving many real-world applications has increasingly captured attention of researchers. Previous work has proposed an intelligent-structure preserving over-sampling method (SPO), which the authors claimed achieved better performance than other existing over-sampling and state-of-the-art methods in TSC. The main disadvantage of over-sampling methods is that they significantly increase the computational cost of training a classification model due to the addition of new minority class instances to balance data-sets with high dimensional features. These challenging issues have motivated us to find a simple and efficient solution for imbalanced TSC. Statistical tests are applied to validate our conclusions. The experimental results demonstrate that this proposed simple random under-sampling technique with SVM is efficient and can achieve results that compare favorably with the existing complicated SPO method for imbalanced TSC.
Top-N recommendation through belief propagation BIBAFull-Text 2343-2346
  Jiwoon Ha; Soon-Hyoung Kwon; Sang-Wook Kim; Christos Faloutsos; Sunju Park
The top-n recommendation focuses on finding the top-n items that the target user is likely to purchase rather than predicting his/her ratings on individual items. In this paper, we propose a novel method that provides top-n recommendation by probabilistically determining the target user's preference on items. This method models the purchasing relationships between users and items as a bipartite graph and employs Belief Propagation to compute the preference of the target user on items. We analyze the proposed method in detail by examining the changes in recommendation accuracy under different parameter settings. We also show that the proposed method is up to 40% more accurate than an existing method by comparing it with an RWR-based method via extensive experiments.
Mining advices from weblogs BIBAFull-Text 2347-2350
  Alfan Farizki Wicaksono; Sung-Hyon Myaeng
Weblog, one of the fastest growing user generated contents, often contains key learnings gleaned from people's past experiences which are really worthy to be well presented to other people. One of the key learnings contained in weblogs is often vented in the form of advice. In this paper, we aim to provide a methodology to extract sentences that reveal advices on weblogs. We observed our data to discover the characteristics of advices contained in weblogs. Based on this observation, we define our task as a classification problem using various linguistic features. We show that our proposed method significantly outperforms the baseline. The presence or absence of imperative mood expression appears to be the most important feature in this task. It is also worth noting that the work presented in this paper is the first attempt on mining advices from English data.
Parallel proximal support vector machine for high-dimensional pattern classification BIBAFull-Text 2351-2354
  Zhenfeng Zhu; Xingquan Zhu; Yangdong Ye; Yue-Fei Guo; Xiangyang Xue
Proximal support vector machine (PSVM) is a simple but effective classifier, especially for solving large-scale data classification problems. An inherent deficiency of PSVM lies on its inefficiency for dealing with high-dimensional data. In this paper, we propose a parallel version of PSVM (PPSVM). Based on random dimensionality partitioning, PPSVM can obtain partitioned local model parameters in parallel, with combined parameters to form the final global solution. In fact, PPSVM enjoys two properties: 1) It can calculate model parameters in parallel and is therefore a fast learning method with theoretically proved convergence; and 2) It can avoid the inversion of large matrix, which makes it suitable for high-dimensional data. In the paper, we also propose a random PPSVM with randomly partitioned data in each iteration to improve the performance of PSVM. Experimental results on real-world data demonstrate that the proposed methods can obtain similar or even better prediction accuracy than PSVM with much better runtime efficiency.
On using category experts for improving the performance and accuracy in recommender systems BIBAFull-Text 2355-2358
  Won-Seok Hwang; Ho-Jong Lee; Sang-Wook Kim; Minsoo Lee
A variety of recommendation methods have been proposed to satisfy the performance and accuracy; however, it is fairly difficult to satisfy both of them because there is a trade-off between them. In this paper, we introduce the notion of category experts and propose the recommendation method by exploiting the ratings of category experts instead of those of the users similar to a target user. We also extend the method that uses both the category preference of a target user and his/her similarity to category experts. We show that our method significantly outperforms the existing methods in terms of performance and accuracy through extensive experiments with real-world data.
Finding influential products on social domination game BIBAFull-Text 2359-2362
  Jinyoung Yeo; Jin-woo Park; Seung-won Hwang
In this paper, we propose a new type of market model called the social domination game model. Given a set C of customers and a set P of products, this model simulates market competition among P and estimates market shares, considering both the dominance relation between C and P and the influence relation among the members of C. With this model, we propose a greedy product positioning algorithm for designing a new product that approximately maximizes market share. Our experimental results show that the proposed algorithm creates a new product gaining up to 97.5% market share of the best product's market share obtained by the exact method, while significantly outperforming the exact method in terms of running time, i.e., by up to two orders of magnitude.
Entity resolution using search engine results BIBAFull-Text 2363-2366
  Madian Khabsa; Pucktada Treeratpituk; C. Lee Giles
Given a set of automatically extracted entities E of size n, we would like to cluster all the various names referring to the same canonical entity together. The variations of each entity include acronyms, full name, and informal naming conventions. We propose using search engine results to cluster variations of each entity based on the URLs appearing in those results. We create a cluster C for each top search result returned by querying for the entity e ∈ E assigning e to the cluster C. Our experiments on a manually created dataset shows that our approach achieves higher precision and recall than string matching algorithm and hierarchical clustering based disambiguation methods.
Tweet classification based on their lifetime duration BIBAFull-Text 2367-2370
  Hikaru Takemura; Keishi Tajima
Many microblog messages remain useful only within a short time, and users often find such a message after its informational value has vanished. Users also sometimes miss old but still useful messages buried among outdated ones. To solve these problems, we develop a method of classifying messages into the following three categories: (1) messages that users should read now because their value will diminish soon, (2) messages that users may read later because their value will not largely change soon, and (3) messages that are not useful anymore because their value has vanished. Our method uses an error correcting output code consisting of binary classifiers each of which determines whether a given message has value at specific time point. Our experiments on Twitter data confirmed that it outperforms naive methods.
Scalable collaborative filtering using incremental update and local link prediction BIBAFull-Text 2371-2374
  Xiao Yang; Zhaoxin Zhang; Ke Wang
The traditional collaborative filtering approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. To address these problems, we present a novel scalable item-based collaborative filtering method by using incremental update and local link prediction. By subdividing the computations and analyzing the factors in different cases of item-to-item similarity, we design the incremental update strategies in item-based CF, which can make the recommender system more efficient and scalable. Based on the transitive structure of item similarity graph, we use the local link prediction method to find implicit candidates to alleviate the lack of neighbors in predictions and recommendations caused by the sparsity of data. The experiment results validate that our algorithm can improve the performance of traditional CF, and can increase the efficiency in recommendations.
Composing activity groups in social networks BIBAFull-Text 2375-2378
  Cheng-Te Li; Man-Kwan Shan
One important function of current social networking services is allowing users to initialize different kinds of activity groups (e.g. study group, cocktail party, and group buying) and invite friends to attend in either manual or collaborative manners. However, such process of group formation is tedious, and could either include inappropriate group members or miss relevant ones. This work proposes to automatically compose the activity groups in a social network according to user-specified activity information. Given the activity host, a set of labels representing the activity's subjects, the desired group size, and a set of must-inclusive persons, we aim to find a set of individuals as the activity group, in which members are required to not only be familiar with the host but also have great communications with each other. We devise an approximation algorithm to greedily solve the group composing problem. Experiments on a real social network show the promising effectiveness of the proposed approach as well as the satisfactory human subjective study.
A co-training based method for Chinese patent semantic annotation BIBAFull-Text 2379-2382
  Xu Chen; Zhiyong Peng; Cheng Zeng
Patents are public and scientific literatures protected by the law, and their abstracts highly contained valuable information. Patent's semantic annotation can effectively protect intellectual property rights and promote corporations' scientific research innovation. Currently, automatic patent annotation mainly used supervised machine learning algorithms, which required abundant expensive labeled patent data. Due to lack of enough labeled Chinese patent data, this paper adopted a semi-supervised machine learning method named co-training, which started from a little labeled data. This method combined keyword extraction with list extraction, and incrementally annotated functional clauses in patent abstract. Experiment results indicated this method can gradually improve the recall without sacrificing the precision.
Automatic labeling hierarchical topics BIBAFull-Text 2383-2386
  Xian-Ling Mao; Zhao-Yan Ming; Zheng-Jun Zha; Tat-Seng Chua; Hongfei Yan; Xiaoming Li
Recently, statistical topic modeling has been widely applied in text mining and knowledge management due to its powerful ability. A topic, as a probability distribution over words, is usually difficult to be understood. A common, major challenge in applying such topic models to other knowledge management problem is to accurately interpret the meaning of each topic. Topic labeling, as a major interpreting method, has attracted significant attention recently. However, previous works simply treat topics individually without considering the hierarchical relation among topics, and less attention has been paid to creating a good hierarchical topic descriptors for a hierarchy of topics. In this paper, we propose two effective algorithms that automatically assign concise labels to each topic in a hierarchy by exploiting sibling and parent-child relations among topics. The experimental results show that the inter-topic relation is effective in boosting topic labeling accuracy and the proposed algorithms can generate meaningful topic labels that are useful for interpreting the hierarchical topics.
An unsupervised method for author extraction from web pages containing user-generated content BIBAFull-Text 2387-2390
  Jing Liu; Xinying Song; Jingtian Jiang; Chin-Yew Lin
In this paper, we address the problem of author extraction (AE) from user generated content (UGC) pages. Most existing solutions for web information extraction, including AE, adopt supervised approaches, which require expensive manual annotation. We propose a novel unsupervised approach for automatically collecting and labeling training data based on two key observations of author names: (1) people tend to use a single name across sites if their preferred names are available; (2) people tend to create unique usernames to easily distinguish themselves from others, e.g. travelbug61. Our AE solution only requires features extracted from a single UGC page instead of relying on clues from multiple UGC pages. We conducted extensive experiments. (1) The evaluation of automatically labeled author field data shows 95.0% precision. (2) Our method achieves an F1 score of 96.1%, which significantly outperforms a state-of-the-art supervised approach with single page features (F1 score: 68.4%) and has a comparable performance to its multiple page solution (F1 score: 95.4%). (3) We also examine the robustness of our approach on various UGC pages from forums and review sites, and achieve promising results as well.
Hierarchical target type identification for entity-oriented queries BIBAFull-Text 2391-2394
  Krisztian Balog; Robert Neumayer
A significant portion of information needs in web search target entities. These may come in different forms or flavours, ranging from short keyword queries to more verbose requests, expressed in natural language. We address the task of automatically annotating queries with target types from an ontology. The identified types can subsequently be used, e.g., for creating semantically more informed query and retrieval models, filtering results, or directing the requests to specific verticals. Our study makes the following contributions. First, we formalise the task of hierarchical target type identification, argue that it is best viewed as a ranking problem, and propose multiple evaluation metrics. Second, we develop a purpose-built test collection by hand-annotating over 300 queries, from various recent entity search benchmarking campaigns, with target types from the DBpedia ontology. Finally, we introduce and examine two baseline models, inspired by federated search techniques. We show that these methods perform surprisingly well when target types are limited to a flat list of top level categories; finding the right level of granularity in the hierarchy, however, is particularly challenging and requires further investigation.
Dictionary based sparse representation for domain adaptation BIBAFull-Text 2395-2398
  Rishabh Mehrotra; Rushabh Agrawal; Syed Aqueel Haider
Machine Learning algorithms are often as good as the data they can learn from. Enormous amount of unlabeled data is readily available and the ability to efficiently use such amount of unlabeled data holds a significant promise in terms of increasing the performance of various learning tasks. We consider the task of supervised Domain Adaptation and present a Self-Taught learning based framework which makes use of the K-SVD algorithm for learning sparse representation of data in an unsupervised manner. To the best of our knowledge this is the first work that integrates K-SVD algorithm into the self-taught learning framework. The K-SVD algorithm iteratively alternates between sparse coding of the instances based on the current dictionary and a process of updating/adapting the dictionary to better fit the data so as to achieve a sparse representation under strict sparsity constraints. Using the learnt dictionary, a rich feature representation of the few labeled instances is obtained which is fed to a classifier along with class labels to build the model. We evaluate our framework on the task of domain adaptation for sentiment classification. Both self-domain (requiring very few domain-specific training instances) and cross-domain classification (requiring 0 labeled instances of target domain and very few labeled instances of source domain) are performed. Empirical comparisons of self-domain and cross-domain results establish the efficacy of the proposed framework.

Information retrieval poster session

Selecting expansion terms as a set via integer linear programming BIBAFull-Text 2399-2402
  Qi Zhang; Yan Wu; Xuanjing Huang
Pseudo-relevance feedback via query expansion has been widely studied from various perspectives in the past decades. Its effectiveness in improving retrieval effectiveness has been shown in many tasks. A variety of criteria were proposed to select additional terms for the original queries. However, most of the existing methods weight and select terms individually and do not consider the impact of term-to-term relationship. In this paper, we first examine the influence of combinations of terms through data analysis, which demonstrate the significant effect of term-to-term relationship on retrieval effectiveness. Then, to address this problem, we formalize the query expansion task as an integer linear programming (ILP) problem. The model combines the weights learned from a supervised method for individual terms, and integrates constraints to capture relations between terms. Finally, three standard TREC collections are used to evaluate the proposed method. Experimental results demonstrate that the proposed method can significantly improve the effectiveness of retrieval.
An evaluation and enhancement of densitometric fragmentation for content slicing reuse BIBAFull-Text 2403-2406
  Killian Levacher; Seamus Lawless; Vincent Wade
Content slicing addresses the need of adaptive systems to reuse open corpus material by converting it into re-composable information objects. However this conversion is highly dependent upon the ability to correctly fragment pages into structurally sound atomic pieces. A recently suggested approach to fragmentation, which relies on densitometric page representation, claims to achieve high accuracy and time performance. Although it has been well received within the research community, a full evaluation of this approach and identification of strengths and weaknesses across a range of characteristics hasn't been performed. This paper proposes an independent evaluation of the approach with respect to granularity control, accuracy, time performance, content diversity and linguistic dependency. Moreover, this paper also provides a significant contribution to address important weaknesses discovered during the analysis, in order to improve the suitability and impact of the original algorithm within the context of content slicing.
Mathematical equation retrieval using plain words as a query BIBAFull-Text 2407-2410
  Shinil Kim; Seon Yang; Youngjoong Ko
This paper proposes how to effectively retrieve the mathematical equations when the plain words are given as a query. The proposed system requires no complicated mathematical symbols, no particular input tool and no constraint of query. Users can enter a query with plain words like the traditional Information Retrieval. For this, we extract features from the plain texts that are converted from the real math equations. Experimental results show an outstanding performance, a MRR of 0.6585.
Serial position effects of clicking behavior on result pages returned by search engines BIBAFull-Text 2411-2414
  Mingda Wu; Shan Jiang; Yan Zhang
Under the joint influence of the presentation of search results and users' browsing and clicking habits, the click probability distribution does not merely obey a monotonic decreasing Zipf function. In this paper, we present evidence that the click behavior on the entries of search engines' result pages is influenced by Serial Position Effect, which is independent of how these entries are ordered, and introduce a new function to characterize the click probability distribution.
Towards measuring the visualness of a concept BIBAFull-Text 2415-2418
  Jin-Woo Jeong; Xin-Jing Wang; Dong-Ho Lee
In this paper, we propose a new method to measure the visualness of a concept. The visualness of a concept is generally defined as what extent a concept has visual characteristics. Even though the visualness of a concept is important and useful for various image search tasks, it has not received much spotlight yet. In this work, we especially focus on how to measure the visualness of a complex concept such as "round table", "dry bed" rather than a simple concept like "ball", "apple". To measure the visualness, we first collect sample images of a complex concept using web image search engines, and then group the images based on the visual features. Finally, we compute visual purity and weighted entropy of the clusters, which will act as a visualness score for the concept. Through various experiments, we show and discuss interesting results about the visualness of a concept.
Fast candidate generation for two-phase document ranking: postings list intersection with bloom filters BIBAFull-Text 2419-2422
  Nima Asadi; Jimmy Lin
Most modern web search engines employ a two-phase ranking strategy: a candidate list of documents is generated using a "cheap" but low-quality scoring function, which is then reranked by an "expensive" but high-quality method (usually machine-learned). This paper focuses on the problem of candidate generation for conjunctive query processing in this context. We describe and evaluate a fast, approximate postings list intersection algorithms based on Bloom filters. Due to the power of modern learning-to-rank techniques and emphasis on early precision, significant speedups can be achieved without loss of end-to-end retrieval effectiveness. Explorations reveal a rich design space where effectiveness and efficiency can be balanced in response to specific hardware configurations and application scenarios.
Semantically coherent image annotation with a learning-based keyword propagation strategy BIBAFull-Text 2423-2426
  Chaoran Cui; Jun Ma; Shuaiqiang Wang; Shuai Gao; Tao Lian
Automatic image annotation plays an important role in modern keyword-based image retrieval systems. Recently, many neighbor-based methods have been proposed and achieved good performance for image annotation. However, existing work mainly focused on exploring a distance metric learning algorithm to determine the neighbors of an image, and neglected the subsequent keyword propagation process. They usually used some simple heuristic propagation rules, and propagated each keyword independently without considering the inherent semantic coherence among keywords. In this paper, we propose a novel learning-based keyword propagation strategy and incorporate it into the neighbor-based method framework. In particular, we employ the structural SVM to learn a scoring function which can evaluate different candidate keyword sets for a test image. Moreover, we explicitly enforce the semantic coherence constraint for the propagated keywords in our approach. The annotation of the test image is propagated as a whole rather than separate keywords. Experiments on two benchmark data sets demonstrate the effectiveness of our approach for image annotation and ranked retrieval.
Language processing for Arabic microblog retrieval BIBAFull-Text 2427-2430
  Kareem Darwish; Walid Magdy; Ahmed Mourad
The use of social media has profoundly affected social and political dynamics in the Arab world. In this paper, we explore the Arabic microblogs retrieval. We illustrate some of the challenges associated with Arabic microblog retrieval, which mainly stem from the use of different Arabic dialects that vary in lexical selection, morphology, and phonetics and lack orthographic and spelling conventions. We present some of the required processing for effective retrieval such as improved letter normalization, elongated word handling, stopword removal, and stemming.
Hierarchical image annotation using semantic hierarchies BIBAFull-Text 2431-2434
  Hichem Bannour; Céline Hudelot
Semantic hierarchies have been introduced recently to improve image annotation. They was used as a framework for hierarchical image classification, and thus to improve classifiers accuracy and reduce the complexity of managing large scale data. In this paper, we investigate the contribution of semantic hierarchies for hierarchical image classification. We propose first a new method based on the hierarchy structure to train efficiently hierarchical classifiers. Our method, named One-Versus-Opposite-Nodes, allows decomposing the problem in several independent tasks and therefore scales well with large database. We also propose two methods for computing a hierarchical decision function that serves to annotate new image samples. The former is performed by a top-down classifiers voting, while the second is based on a bottom-up score fusion. The experiments on Pascal VOC'2010 dataset showed that our methods improve well the image annotation results.
On the inference of average precision from score distributions BIBAFull-Text 2435-2438
  Ronan Cummins
Modelling the document scores returned from an IR system for a given query using parameterised score distributions is an area of research that has become more popular in recent years. Score distribution (SD) models are useful for a number of IR tasks. These include data fusion, query performance prediction, determining thresholds in filtering applications, and tasks in the area of distributed retrieval. The inference of performance metrics, such as average precision, from these SD models is an important consideration. In this paper, we study the accuracy of a number of methods of inferring average precision from an SD model.
An evaluation of corpus-driven measures of medical concept similarity for information retrieval BIBAFull-Text 2439-2442
  Bevan Koopman; Guido Zuccon; Peter Bruza; Laurianne Sitbon; Michael Lawley
Measures of semantic similarity between medical concepts are central to a number of techniques in medical informatics, including query expansion in medical information retrieval. Previous work has mainly considered thesaurus-based path measures of semantic similarity and has not compared different corpus-driven approaches in depth. We evaluate the effectiveness of eight common corpus-driven measures in capturing semantic relatedness and compare these against human judged concept pairs assessed by medical professionals. Our results show that certain corpus-driven measures correlate strongly (approx 0.8) with human judgements. An important finding is that performance was significantly affected by the choice of corpus used in priming the measure, i.e., used as evidence from which corpus-driven similarities are drawn. This paper provides guidelines for the implementation of semantic similarity measures for medical informatics and concludes with implications for medical information retrieval.
A constraint to automatically regulate document-length normalisation BIBAFull-Text 2443-2446
  Ronan Cummins; Colm O'Riordan
Retrieval functions in information retrieval (IR) are fundamental to the effectiveness of search systems. However, considerable parameter tuning is often needed to increase the effectiveness of the retrieval. Document length normalisation is one such aspect that requires tuning on a per-query and per-collection basis for many retrieval functions.
   In this paper, we develop an approach that regularises the level of normalisation to apply on a per-query basis. We formally describe the interaction between query-terms and document length normalisation using a constraint. We then develop a general pre-retrieval approach to adapt a number of state-of-the-art ranking functions so that they adhere to the constraint.
   Finally, we empirically demonstrate that the adapted retrieval functions outperform default versions of the original retrieval functions, and perform at least comparably to tuned versions of the original functions, on a number of datasets. Essentially this regulates the normalisation parameter in a number of retrieval functions on a per-query basis in a principled manner.
Bridging offline and online social graph dynamics BIBAFull-Text 2447-2450
  Manuel Gomez Rodriguez; Monica Rogati
The online and offline worlds are converging. Location-based services, ubiquitous mobile devices and on-the-go social network accessibility are blurring the distinction between in-person activities and their virtual counterpart. An important effect of this convergence is the rapid and powerful impact of offline events (meetings, conferences) on the evolution and temporal dynamics of the online connectivity between members of social and professional networks. However, these effects have been largely unexplored.
   We study these effects by using data from LinkedIn, a popular professional social networking site. We find that offline events may induce connectivity changes in the online network -- there is a dramatic increase in the number of connections between event attendees shortly after the date of the event. Building on these insights, we describe a non-supervised method that exploits connectivity changes temporally correlated to real world events to successfully infer more than 40% of specific event attendees. Finally, we revisit the link prediction problem by including user contributed information about off-line events to achieve higher link prediction performance.
Predicting the performance of passage retrieval for question answering BIBAFull-Text 2451-2454
  Eyal Krikon; David Carmel; Oren Kurland
We present a novel approach to predicting the performance of passage retrieval for question answering. That is, estimating the effectiveness, for answer extraction, of a list of passages retrieved in response to a question when relevance judgments are not available. Our prediction model integrates two types of estimates. The first estimates the probability that the information need expressed by the question is satisfied by the passages. This estimate is devised by adapting query-performance predictors developed for the document retrieval task. The second type estimates the probability that the passages contain the answers. This estimate relies on the occurrences of named entities that are likely to answer the question. Empirical evaluation demonstrates the merits of our prediction approach. For example, the prediction quality is much better than that of the only previous prediction method devised for the task at hand.
Coarse-to-fine sentence-level emotion classification based on the intra-sentence features and sentential context BIBAFull-Text 2455-2458
  Jun Xu; Ruifeng Xu; Qin Lu; Xiaolong Wang
This paper proposes a novel approach using a coarse-to-fine analysis strategy for sentence-level emotion classification which takes into consideration of similarities to sentences in training set as well as adjacent sentences in the context. First, we use intra-sentence based features to determine the emotion label set of a target sentence coarsely through the statistical information gained from the label sets of the k most similar sentences in the training data. Then, we use the emotion transfer probabilities between neighboring sentences to refine the emotion labels of the target sentences. Such iterative refinements terminate when the emotion classification converges. The proposed algorithm is evaluated on Ren-CECps, a Chinese blog emotion corpus. Experimental results show that the coarse-to-fine emotion classification algorithm improves the sentence-level emotion classification by 19.11% on the average precision metric, which outperforms the baseline methods.
Query-performance prediction and cluster ranking: two sides of the same coin BIBAFull-Text 2459-2462
  Oren Kurland; Fiana Raiber; Anna Shtok
We show that two tasks which were independently addressed in the information retrieval literature actually amount to the exact same task. The first is query performance prediction; i.e., estimating the effectiveness of a search performed in response to a query in the absence of relevance judgments. The second task is cluster ranking, that is, ranking clusters of similar documents by their presumed effectiveness (i.e., relevance) with respect to the query. Furthermore, we show that several state-of-the-art methods that were independently devised for each of the two tasks are based on the same principles. Finally, we empirically demonstrate that using insights gained in work on query-performance prediction can help, in many cases, to improve the performance of a previously proposed cluster ranking method.
Learning to rank search results for time-sensitive queries BIBAFull-Text 2463-2466
  Nattiya Kanhabua; Kjetil Nørvåg
Retrieval effectiveness of temporal queries can be improved by taking into account the time dimension. Existing temporal ranking models follow one of two main approaches: 1) a mixture model linearly combining textual similarity and temporal similarity, and 2) a probabilistic model generating a query from the textual and temporal part of document independently. In this paper, we propose a novel time-aware ranking model based on learning-to-rank techniques. We employ two classes of features for learning a ranking model, entity-based and temporal features, which are derived from annotation data. Entity-based features are aimed at capturing the semantic similarity between a query and a document, whereas temporal features measure the temporal similarity. Through extensive experiments we show that our ranking model significantly improves the retrieval effectiveness over existing time-aware ranking models.
On active learning in hierarchical classification BIBAFull-Text 2467-2470
  Yu Cheng; Kunpeng Zhang; Yusheng Xie; Ankit Agrawal; Alok Choudhary
Most of the existing active learning algorithms assume all the category labels as independent or consider them in a "flat" structure. However, in reality, there are many applications in which the set of possible labels are often organized in a hierarchical structure. In this paper, we consider the problem of active learning when the categories are represented as a tree. Our goal is to exploit the structure information of the label tree in active learning to select the most informative samples to be labeled. We propose an algorithm that estimates the semantic space, embedding the category hierarchy. In this space, each category label is represented as a prototype and the uncertainty is measured using a variance-based fashion. We also demonstrate notable performance improvement with the proposed approach on synthetic and real datasets.
Question-answer topic model for question retrieval in community question answering BIBAFull-Text 2471-2474
  Zongcheng Ji; Fei Xu; Bin Wang; Ben He
The major challenge for Question Retrieval (QR) in Community Question Answering (CQA) is the lexical gap between the queried question and the historical questions. This paper proposes a novel Question-Answer Topic Model (QATM) to learn the latent topics aligned across the question-answer pairs to alleviate the lexical gap problem, with the assumption that a question and its paired answer share the same topic distribution. Experiments conducted on a real world CQA dataset from Yahoo! Answers show that combining both parts properly can get more knowledge than each part or both parts in a simple mixing way and combining our QATM with the state-of-the-art translation-based language model, where the topic and translation information is learned from the question-answer pairs at two different grained semantic levels respectively, can significantly improve the QR performance.
How do humans distinguish different people with identical names on the web? BIBAFull-Text 2475-2478
  Harumi Murakami; Yuki Miyake
This research investigates how humans distinguish different people with identical names on the web to improve web people search. We asked subjects to classify 20 pages of web people-search results for each of 20 person names and analyzed their decision processes through questionnaire, protocol analysis, and interview. We found that keywords, vocations, works (for a real person, works are those made by the individual and, for a fictional person, works are those in which the individual appears), facial images, and the names of related people are important for distinguishing individuals. We proposed a model for distinguishing individuals and a knowledge-structure model based on the experiment's results.
Enhancing product search by best-selling prediction in e-commerce BIBAFull-Text 2479-2482
  Bo Long; Jiang Bian; Anlei Dong; Yi Chang
With the rapid growth of E-Commerce on the Internet, online product search service has emerged as a popular and effective paradigm for customers to find desired products and select transactions. Most product search engines today are based on adaptations of relevance models devised for information retrieval. However, there is still a big gap between the mechanism of finding products that customers really desire to purchase and that of retrieving products of high relevance to customers' query. In this paper, we address this problem by proposing a new ranking framework for enhancing product search based on dynamic best-selling prediction in E-Commerce. Specifically, we first develop an effective algorithm to predict the dynamic best-selling, i.e. the volume of sales, for each product item based on its transaction history. By incorporating such best-selling prediction with relevance, we propose a new ranking model for product search, in which we rank higher the product items that are not only relevant to the customer's need but with higher probability to be purchased by the customer. Results of a large scale evaluation, conducted over the dataset from a commercial product search engine, demonstrate that our new ranking method is more effective for locating those product items that customers really desire to buy at higher rank positions without hurting the search relevance.
Survival analysis for freshness in microblogging search BIBAFull-Text 2483-2486
  Gianni Amati; Giuseppe Amodeo; Carlo Gaibisso
Freshness of information in real-time search is central in social networks, news, blogs and micro-blogs. Nevertheless, there is not a clear experimental evidence that shows what principled approach effectively combines time and content.
   We introduce a novel approach to model freshness using a survival analysis of relevance over time. In such models, freshness is measured by the tail probability of relevance over time. We also assume that the probability distributions for freshness are heavy-tailed. The heavy-tailed models of freshness are shown to be highly effective on the micro-blogging test collection of TREC 2011. The improvements over the state-of-the-art time-based models are statistically significant or moderately significant.
Information preservation in static index pruning BIBAFull-Text 2487-2490
  Ruey-Cheng Chen; Chia-Jung Lee; Chiung-Min Tsai; Jieh Hsiang
We develop a new static index pruning criterion based on the notion of information preservation. This idea is motivated by the fact that model degeneration, as does static index pruning, inevitably reduces the predictive power of the resulting model. We model this loss in predictive power using conditional entropy and show that the decision in static index pruning can therefore be optimized to preserve information as much as possible. We evaluated the proposed approach on three different test corpora, and the result shows that our approach is comparable in retrieval performance to state-of-the-art methods. When efficiency is of concern, our method has some advantages over the reference methods and is therefore suggested in Web retrieval settings.
Temporal models for microblogs BIBAFull-Text 2491-2494
  Jaeho Choi; W. Bruce Croft
Time information impacts relevance in retrieval for the queries that are sensitive to trends and events. Microblog services particularly focused on recent news and events so dealing with the temporal aspects of microblogs is essential for providing effective retrieval. Recent work on time-based retrieval has shown that selecting the relevant time period for query expansion is promising. In this paper, we suggest a method for selecting the time period for query expansion based on a user behavior (i.e., retweets) that can be collected easily. We then use these time periods for query expansion in a pseudo-relevance feedback setting. More specifically, we use the difference in the temporal distribution between the top retrieved documents and retweets. The experimental results based on the TREC Microblog collection show that our method for selecting periods for query expansion improves retrieval performance compared to another approach.
I want what i need!: analyzing subjectivity of online forum threads BIBAFull-Text 2495-2498
  Prakhar Biyani; Cornelia Caragea; Amit Singh; Prasenjit Mitra
Online forums have become a popular source of information due to the unique nature of information they contain. Internet users use these forums to get opinions of other people on issues and to find factual answers to specific questions. Topics discussed in online forum threads can be subjective seeking personal opinions or non-subjective seeking factual information. Hence, knowing subjectivity orientation of threads would help forum search engines to satisfy user's information needs more effectively by matching the subjectivities of user's query and topics discussed in the threads in addition to lexical match between the two. We study methods to analyze the subjectivity of online forum threads. Experimental results on a popular online forum demonstrate the effectiveness of our methods.
Improving the performance of the reinforcement learning model for answering complex questions BIBAFull-Text 2499-2502
  Yllias Chali; Sadid A. Hasan; Kaisar Imam
This paper addresses the task of answering complex questions using a multi-document summarization approach within a reinforcement learning setting. Given a set of complex questions, a list of relevant documents per question, and the corresponding human-generated summaries (i.e. answers to the questions) as training data, the reinforcement learning module iteratively learns a number of feature weights in order to facilitate the automatic generation of summaries i.e. answers to unseen complex questions. Previous works on this task have utilized a fully automatic reinforcement learning framework that selects the document sentences as the potential candidate (i.e. machine-generated) summary sentences by exploiting a relatedness measure with the available human-written summaries. In this paper, we propose an extension to this model that incorporates user interaction into the reinforcement learner to guide the candidate summary sentence selection process. Experimental results reveal the effectiveness of the user interaction component in the reinforcement learning framework.
Relation regularized subspace recommending for related scientific articles BIBAFull-Text 2503-2506
  Qing Zhang; Jianwu Li; Zhiping Zhang; Li Wang
Recommending related scientific articles for a researcher is very important and useful in practice but also is full of challenges due to the latent complex semantic relations among scientific literatures. To deal with these challenges, this paper proposes a novel framework with link-missing data adaption, which casts the recommendation task to subspace embedding and similarity ranking problems. The relation regularized subspace in this framework is constructed via Relation Regularized Matrix Factorization (RRMF) for well modeling both content and link structure simultaneously. However, the link structure for an article is not always available in practical recommending. To solve this problem, we further propose two alternative approaches based on Latent Dirichlet Allocation (LDA) for link-missing articles recommendation as an extension of RRMF. Experiments on CiteSeer dataset demonstrate our method is more effective in comparison with some state-of-the-art approaches and is able to handle the link-missing case which the link-based methods never can fit.
Exploring the cluster hypothesis, and cluster-based retrieval, over the web BIBAFull-Text 2507-2510
  Fiana Raiber; Oren Kurland
We present a study of the cluster hypothesis, and of the performance of cluster-based retrieval methods, performed over large scale Web collections. Among the findings we present are (i) the cluster hypothesis can hold, as determined by a specific test, for large scale Web corpora to the same extent it does for newswire corpora; (ii) while spam documents do not affect the extent to which the cluster hypothesis holds, they considerably affect the performance of cluster based, as well as that of document-based, retrieval methods; and, (iii) as is the case for newswire corpora, cluster-based methods can yield better performance than document-based methods for Web corpora.
A picture paints a thousand words: a method of generating image-text timelines BIBAFull-Text 2511-2514
  Shize Xu; Liang Kong; Yan Zhang
Manual timelines have greatly helped us to keep pace with the big world. In this paper, we introduce a novel solution which generates image-text timelines for news events based on Evolutionary Image-Text Summarization, which is an important and challenging problem. We first extract image's semantic information under translation model, and then fuse the high quality images with text timeline under an image assignment algorithm which can optimize the global coordination of the final timeline. The experimental results show that news readers can receive more satisfaction from the image-text timelines we generate.
Short-text domain specific key terms/phrases extraction using an n-gram model with wikipedia BIBAFull-Text 2515-2518
  M. Atif Qureshi; Colm O'Riordan; Gabriella Pasi
Finding domain specific key terms/phrases from a given set of documents is a challenging task. A domain may be defined as an area of interest over a collection of documents which may not be explicitly defined but implicitly observable in those documents. When considering a collection of documents related to academic research, examples of key terms/phrases may be Information Retrieval", "Marine Biology", etc. In this paper a technique for extracting important key terms/phrases in a considered topical domain is proposed using external evidence from the titles of Wikipedia articles and the Wikipedia category graph. We performed some experiments over the document collection of Web sites of different post-graduate schools. Our preliminary evaluations show promising results for the detection of domain specific key terms/phrases from the given set of domain focused Web pages.
A new probabilistic model for top-k ranking problem BIBAFull-Text 2519-2522
  Shuzi Niu; Yanyan Lan; Jiafeng Guo; Xueqi Cheng
This paper is concerned with top-k ranking problem, which reflects the fact that people pay more attention to the top ranked objects in real ranking application like information retrieval. A popular approach to top-k ranking problem is based on probabilistic models, such as Luce model and Mallows model. However, whether the sequential generative process described in these models is a suitable way for top-k ranking remains a question. According to the riffled independence factorization proposed in recent literature, which is a natural structural assumption on top-k ranking, we propose a new generative process of top-k ranking data. Our approach decomposes distributions over the top-k ranking into two layers: the first layer describes the relative ordering between the top k objects and the rest n-k objects, and the second layer describes the full ordering on the top k objects. On this basis, we propose a new probabilistic model for top-k ranking problem, called hierarchical ordering model. Specifically, we use three different probabilistic models to describe different generative processes of the first layer, and Luce model to describe the sequential generative process of the second layer, thus we obtain three different specific hierarchical ordering models. We also conduct extensive experiments on benchmark datasets to show that our proposed models can outperform previous models significantly.
Large scale analysis of changes in English vocabulary over recent time BIBAFull-Text 2523-2526
  Adam Jatowt; Katsumi Tanaka
Recently many historical texts have become digitized and made accessible for search and browsing. As human language is subject to constant evolution, these texts pose varying challenges to current users. In this paper we report the results of large-scale studies on the usage of words and the evolution of English language vocabulary over the last two centuries to help with understanding its impact on readability and retrieval of historical documents. We perform analysis of several lexical factors which may influence accessibility and readability of historical texts based on two large scale lexical corpora: the Corpus of Historical American English and Google Books 1-gram.
Climbing the app wall: enabling mobile app discovery through context-aware recommendations BIBAFull-Text 2527-2530
  Alexandros Karatzoglou; Linas Baltrunas; Karen Church; Matthias Böhmer
The explosive growth of the mobile application (app) market has made it difficult for users to find the most interesting and relevant apps from the hundreds of thousands that exist today. Context is key in the mobile space and so too are proactive services that ease user input and facilitate effective interaction. We believe that to enable truly novel mobile app recommendation and discovery, we need to support real context-aware recommendation that utilizes the diverse range of implicit mobile data available in a fast and scalable manner. In this paper we introduce the Djinn model, a novel context-aware collaborative filtering algorithm for implicit feedback data that is based on tensor factorization. We evaluate our approach using a dataset from an Android mobile app recommendation service called appazaar. Our results show that our approach compares favorably with state-of-the-art collaborative filtering methods.
TwiSent: a multistage system for analyzing sentiment in Twitter BIBAFull-Text 2531-2534
  Subhabrata Mukherjee; Akshat Malu; A Balamurali A. R.; Pushpak Bhattacharyya
In this paper, we present TwiSent, a sentiment analysis system for Twitter. Based on the topic searched, TwiSent collects tweets pertaining to it and categorizes them into the different polarity classes positive, negative and objective. However, analyzing micro-blog posts have many inherent challenges compared to the other text genres. Through TwiSent, we address the problems of 1) Spams pertaining to sentiment analysis in Twitter, 2) Structural anomalies in the text in the form of incorrect spellings, nonstandard abbreviations, slangs etc., 3) Entity specificity in the context of the topic searched and 4) Pragmatics embedded in text. The system performance is evaluated on manually annotated gold standard data and on an automatically annotated tweet set based on hashtags. It is a common practise to show the efficacy of a supervised system on an automatically annotated dataset. However, we show that such a system achieves lesser classification accuracy when tested on generic twitter dataset. We also show that our system performs much better than an existing system.
Twitter hyperlink recommendation with user-tweet-hyperlink three-way clustering BIBAFull-Text 2535-2538
  Dehong Gao; Renxian Zhang; Wenjie Li; Yuexian Hou
Twitter, the most famous micro-blogging service and online social network, collects millions of tweets every day. Due to the length limitation, users usually need to explore other ways to enrich the content of their tweets. Some studies have provided findings to suggest that users can benefit from added hyperlinks in tweets. In this paper, we focus on the hyperlinks in Twitter and propose a new application, called hyperlink recommendation in Twitter. We expect that the recommended hyperlinks can be used to enrich the information of user tweets. A three-way tensor is used to model the user-tweet-hyperlink collaborative relations. Two tensor-based clustering approaches, tensor decomposition-based clustering (TDC) and tensor approximation-based clustering (TAC) are developed to group the users, tweets and hyperlinks with similar interests, or similar contexts. Recommendation is then made based on the reconstructed tensor using cluster information. The evaluation results in terms of Mean Absolute Error (MAE) shows the advantages of both the TDC and TAC approaches over a baseline recommendation approach, i.e., memory-based collaborative filtering. Comparatively, the TAC approach achieves better performance than the TDC approach.
Concavity in IR models BIBAFull-Text 2539-2542
  Stéphane Clinchant
We study the impact of concavity in IR models and propose to use a generalized logarithm function, the n-logarithm to weight words in documents. We extend the family of information based Information Retrieval (IR) models with this function. We show that that concavity is indeed an important property of IR models. Experiments conducted for IR tasks, Latent Semantic Indexing and Text Categorization show improvements.
Extracting interesting association rules from toolbar data BIBAFull-Text 2543-2546
  Ilaria Bordino; Debora Donato; Barbara Poblete
Toolbar navigation logs provide rich data for enhancing information discovery on the Web. The value of this data resides in its scope, which goes beyond that of traditional query-mining data sources, such as search-engine logs. In this paper we present a methodology for extracting relevant association rules for queries, based on historic user navigational data. In addition, we propose a graph-based approach for extracting related queries and URLs for a given query.
Predicting CTR of new ads via click prediction BIBAFull-Text 2547-2550
  Alexander Kolesnikov; Yury Logachev; Valeriy Topinskiy
Predicting CTR of ads on the search result page is an urgent topic. The reason for this is that choosing the right advertisement greatly affects revenue of the search engine and advertisers and user's satisfaction. For ads with the large click history it is quite clear how to predict CTR by utilizing statistical data. But for new ads with a poor click history such approach is not robust and reliable. We suggest a model for predicting CTR of such new ads. Contrary to the previous models of predicting CTR of new ads, our model uses events -- clicks and skips1 instead of the observed CTR. In addition we have implemented several novel features, that resulted into the increase of the performance of our model. Offline and online experiments on the real search engine system demonstrated that our model outperforms the baseline and the approaches suggested in previous papers.
An examination of content farms in web search using crowdsourcing BIBAFull-Text 2551-2554
  Richard McCreadie; Craig Macdonald; Iadh Ounis; Jim Giles; Ferris Jabr
On the Web, content farms produce articles engineered such that search engines rank them highly, in order to turn a profit from online advertising. Recently, content farms have increasingly been the target of demotion strategies by Web search engines, since content farm articles are often considered to be of suspect quality. In this paper, we study the prevalence of content farms in the results returned by three major Web search engines over time. In particular, we develop a crowdsourced approach to identify content farm articles from the results returned by these search engines. Our results show that between the period of March and August 2011, the number of content farm articles observed on a number of indicative queries was reduced by up to 55% in the top ranks.
Demographic context in web search re-ranking BIBAFull-Text 2555-2558
  Eugene Kharitonov; Pavel Serdyukov
In this paper we study usefulness of user's demographical context for improving ranking of ambiguous queries. Context-aware relevance model is learnt from implicit user behaviour by using a simple yet general modification of a state-of-art click model which is capable to catch dependences from the search context. After that the machine learned click model is used in an offline re-ranking experiment and it is demonstrated that the demographical context ranking features provide improvements in ranking quality. Further, we perform a study to investigate the impact of different facets of demographical features (gender, age, and income) on search ranking performance and manually analyse queries which exhibit strong context dependences to get an additional understanding of the model behaviour.
On the usefulness of query features for learning to rank BIBAFull-Text 2559-2562
  Craig Macdonald; Rodrygo L. T. Santos; Iadh Ounis
Learning to rank studies have mostly focused on query-dependent and query-independent document features, which enable the learning of ranking models of increased effectiveness. Modern learning to rank techniques based on regression trees can support query features, which are document-independent, and hence have the same values for all documents being ranked for a query. In doing so, such techniques are able to learn sub-trees that are specific to certain types of query. However, it is unclear which classes of features are useful for learning to rank, as previous studies leveraged anonymised features. In this work, we examine the usefulness of four classes of query features, based on topic classification, the history of the query in a query log, the predicted performance of the query, and the presence of concepts such as persons and organisations in the query. Through experiments on the ClueWeb09 collection, our results using a state-of-the-art learning to rank technique based on regression trees show that all four classes of query features can significantly improve upon an effective learned model that does not use any query feature.
Session-based query performance prediction BIBAFull-Text 2563-2566
  Andrey Kustarev; Yury Ustinovskiy; Anna Mazur; Pavel Serdyukov
Search sessions are known to be a rich source of diverse valuable information for individual query analysis. In this paper, we address the problem of query performance prediction by utilizing the entire logical search sessions containing the given query. Guided by the intuitions based on the observations made after the analysis of the search sessions' properties and performance of the queries they contain, we propose a number of features that significantly advance the existing query performance prediction models. Some of them specifically allow to focus on tail queries with sparse click-through statistics.
A latent pairwise preference learning approach for recommendation from implicit feedback BIBAFull-Text 2567-2570
  Yi Fang; Luo Si
Most of the current recommender systems heavily rely on explicit user feedback such as ratings on items to model users' interests. However, in many applications, it is very hard to collect the explicit feedback, while implicit feedback such as user clicks may be more available. Furthermore, it is often more suitable for many recommender systems to address a ranking problem than a rating predicting problem. This paper proposes a latent pairwise preference learning (LPPL) approach for recommendation with implicit feedback. LPPL directly models user preferences with respect to a set of items rather than the rating scores on individual items, which are modeled with a set of features by analyzing clickthrough data available in many real-world recommender systems. The LPPL approach models both the latent variables of group structure of users and the pairwise preferences simultaneously. We conduct experiments on the testbed from a real-world recommender system and demonstrate that the proposed approach can effectively improve the recommendation performance against several baseline algorithms.
Topic based pose relevance learning in dance archives BIBAFull-Text 2571-2574
  Reede Ren; John Collomosse; Joemon Jose
This paper improves spatial pyramid kernel (SPK) and proposes a relevance learning approach to compare performer's poses in a large dance archive, the NRCD collection1. Domain knowledge of Choreutics is exploited to define pose topics and a selection operator is developed for pose topic matching. The visual structure descriptor of self similarity (SSF) is extended to hierarchical self similarity (HSSF) to keep shape context. The framework of Bag-of-Visual Words (BOVW) is applied to encode as well as to speed up the matching on pose topics/topic combinations. This alleviates the complexity in limb allocation which is infeasible in our data. Extensive experiments show that the new approach outperforms the original SPK in both precision and robustness.
PhotoFall: discovering weblog stories through photographs BIBAFull-Text 2575-2578
  Christopher Wienberg; Andrew S. Gordon
An effective means of retrieving relevant photographs from the web is to search for terms that would likely appear in the surrounding text in multimedia documents. In this paper, we investigate the complementary search strategy, where relevant multimedia documents are retrieved using the photographs they contain. We concentrate our efforts on the retrieval of large numbers of personal stories posted to Internet weblogs that are relevant to a particular search topic. Photographs are often included in posts of this sort, typically taken by the author during the course of the narrated events of the story. We describe a new story search tool, PhotoFall, which allows users to quickly find stories related to their topic of interest by judging the relevance of the photographs extracted from top search results. We evaluate the accuracy of relevance judgments made using this interface, and discuss the implications of the results for improving topic-based searches of multimedia content.
RESQ: rank-energy selective query forwarding for distributed search systems BIBAFull-Text 2579-2582
  Amin Teymorian; Xiao Qin; Ophir Frieder
Selective query forwarding is a promising technique to help scale high-quality and cost-efficient query evaluation in distributed search systems. The basic idea is simple. After a local site receives a query, it determines non-local sites to forward the query to and returns an aggregation of local and non-local results. We introduce "RESQ", a hybrid rank-energy selective query forwarding model. The novel contribution of RESQ is to simultaneously consider both ranking quality and energy costs when making forwarding decisions. Using a large-scale query log and publicly-available energy price time series, we demonstrate the ability of RESQ forwarding to achieve favorable tradeoffs between the possibility of returning high ranking query results and savings in temporally- and spatially-varying energy prices.
The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy BIBAFull-Text 2583-2586
  Gabriella Kazai; Jaap Kamps; Natasa Milic-Frayling
Information retrieval systems require human contributed relevance labels for their training and evaluation. Increasingly such labels are collected under the anonymous, uncontrolled conditions of crowdsourcing, leading to varied output quality. While a range of quality assurance and control techniques have now been developed to reduce noise during or after task completion, little is known about the workers themselves and possible relationships between workers' characteristics and the quality of their work. In this paper, we ask how do the relatively well or poorly-performing crowds, working under specific task conditions, actually look like in terms of worker characteristics, such as demographics or personality traits. Our findings show that the face of a crowd is in fact indicative of the quality of their work.
Data filtering in humor generation: comparative analysis of hit rate and co-occurrence rankings as a method to choose usable pun candidates BIBAFull-Text 2587-2590
  Pawel Dybala; Rafal Rzepka; Kenji Araki; Kohichi Sayama
In this paper we propose a method of filtering excessive amount of textual data acquired from the Internet. In our research on pun generation in Japanese we experienced problems with extensively long data processing time, caused by the amount of phonetic candidates generated (i.e. phrases that can be used to generate actual puns) by our system. Simple, naive approach in which we take into considerations only phrases with the highest occurrence in the Internet, can effect in deletion of those candidates that are actually usable. Thus, we propose a data filtering method in which we compare two Internet-based rankings: a co-occurrence ranking and a hit rate ranking, and select only candidates which occupy the same or similar positions in these rankings. In this work we analyze the effects of such data reduction, considering 1 cases: when the candidates are on exactly the same positions in both rankings, and when their positions differ by 1, 2, 3 and 4. The analysis is conducted on data acquired by comparing pun candidates generated by the system (and filtered with our method) with phrases that were actually used in puns created by humans. The results show that the proposed method can be used to filter excessive amounts of textual data acquired from the Internet.
Predicting primary categories of business listings for local search BIBAFull-Text 2591-2594
  Changsung Kang; Jeehaeng Lee; Yi Chang
We consider the problem of identifying primary categories of a business listing among the categories provided by the owner of the business. The category information submitted by business owners cannot be trusted with absolute certainty since they may purposefully add some secondary or irrelevant categories to increase recall in local search results, which makes category search very challenging for local search engines. Thus, identifying primary categories of a business is a crucial problem in local search. This problem can be cast as a multi-label classification problem with a large number of categories. However, the large scale of the problem makes it infeasible to use conventional supervised-learning-based text categorization approaches.
   We propose a large-scale classification framework that leverages multiple types of classification labels to produce a highly accurate classifier with fast training time. We effectively combine the complementary label sources to refine prediction. The experimental results indicate that our framework achieves very high precision and recall and outperforms a Centroid-based method.
Where do the query terms come from?: an analysis of query reformulation in collaborative web search