HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 2021222324252627282930313233

ACM Transactions on Information Systems 30

Editors:Jamie Callan
Standard No:ISSN 1046-8188; HF S548.125 A33
Links:Table of Contents
  1. TOIS 2012-02 Volume 30 Issue 1
  2. TOIS 2012-05 Volume 30 Issue 2
  3. TOIS 2012-08 Volume 30 Issue 3
  4. TOIS 2012-11 Volume 30 Issue 4

TOIS 2012-02 Volume 30 Issue 1

Word-based self-indexes for natural language text BIBAFull-Text 1
  Antonio Fariña; Nieves R. Brisaboa; Gonzalo Navarro; Francisco Claude; Ángeles S. Places; Eduardo Rodríguez
The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on single-word searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases.
   We adapt self-indexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve word-based self-indexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.
Static index pruning in web search engines: Combining term and document popularities with query views BIBAFull-Text 2
  Ismail S. Altingovde; Rifat Ozcan; Özgür Ulusoy
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; that is, without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, that is, retrieves this document among its top results. In this paper, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally, we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine.
Summarizing figures, tables, and algorithms in scientific publications to augment search results BIBAFull-Text 3
  Sumit Bhatia; Prasenjit Mitra
Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-element-search-engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short "synopsis" of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.
Multiple testing in statistical analysis of systems-based information retrieval experiments BIBAFull-Text 4
  Benjamin A. Carterette
High-quality reusable test collections and formal statistical hypothesis testing together support a rigorous experimental environment for information retrieval research. But as Armstrong et al. [2009b] recently argued, global analysis of experiments suggests that there has actually been little real improvement in ad hoc retrieval effectiveness over time. We investigate this phenomenon in the context of simultaneous testing of many hypotheses using a fixed set of data. We argue that the most common approaches to significance testing ignore a great deal of information about the world. Taking into account even a fairly small amount of this information can lead to very different conclusions about systems than those that have appeared in published literature. We demonstrate how to model a set of IR experiments for analysis both mathematically and practically, and show that doing so can cause p-values from statistical hypothesis tests to increase by orders of magnitude. This has major consequences on the interpretation of experimental results using reusable test collections: it is very difficult to conclude that anything is significant once we have modeled many of the sources of randomness in experimental design and analysis.
High-performance processing of text queries with tunable pruned term and term pair indexes BIBAFull-Text 5
  Andreas Broschart; Ralf Schenkel
Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This article introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance under controlled result quality, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of more than one order of magnitude over existing text-based processing techniques with reasonable index sizes.
Large-scale validation and analysis of interleaved search evaluation BIBAFull-Text 6
  Olivier Chapelle; Thorsten Joachims; Filip Radlinski; Yisong Yue
Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture of its efficiency and effectiveness is still lacking. In this paper we extend and combine the body of empirical evidence regarding interleaving, and provide a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature. In particular, we analyze the agreement of interleaving with manual relevance judgments and observational implicit feedback measures, estimate the statistical efficiency of interleaving, and explore the relative performance of different interleaving variants. We also show how to learn improved credit-assignment functions for clicks that further increase the sensitivity of interleaving.

TOIS 2012-05 Volume 30 Issue 2

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives BIBAFull-Text 7
  Xin Cao; Gao Cong; Bin Cui; Christian S. Jensen; Quan Yuan
Community Question Answering (CQA) is a popular type of service where users ask questions and where answers are obtained from other users or from historical question-answer pairs. CQA archives contain large volumes of questions organized into a hierarchy of categories. As an essential function of CQA services, question retrieval in a CQA archive aims to retrieve historical question-answer pairs that are relevant to a query question. This article presents several new approaches to exploiting the category information of questions for improving the performance of question retrieval, and it applies these approaches to existing question retrieval models, including a state-of-the-art question retrieval model. Experiments conducted on real CQA data demonstrate that the proposed techniques are effective and efficient and are capable of outperforming a variety of baseline methods significantly.
A Probabilistic Model to Combine Tags and Acoustic Similarity for Music Retrieval BIBAFull-Text 8
  Riccardo Miotto; Nicola Orio
The rise of the Internet has led the music industry to a transition from physical media to online products and services. As a consequence, current online music collections store millions of songs and are constantly being enriched with new content. This has created a need for music technologies that allow users to interact with these extensive collections efficiently and effectively. Music search and discovery may be carried out using tags, matching user interests and exploiting content-based acoustic similarity. One major issue in music information retrieval is how to combine such noisy and heterogeneous information sources in order to improve retrieval effectiveness. With this aim in mind, the article explores a novel music retrieval framework based on combining tags and acoustic similarity through a probabilistic graph-based representation of a collection of songs. The retrieval function highlights the path across the graph that most likely observes a user query and is used to improve state-of-the-art music search and discovery engines by delivering more relevant ranking lists. Indeed, by means of an empirical evaluation, we show how the proposed approach leads to better performances than retrieval strategies which rank songs according to individual information sources alone or which use a combination of them.
Peer-to-Peer Information Retrieval: An Overview BIBAFull-Text 9
  Almer S. Tigelaar; Djoerd Hiemstra; Dolf Trieschnigg
Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these has seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralized solutions. In this article we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralized client-server solutions in terms of scalability, performance, user satisfaction, and freedom.
Exploring Question Selection Bias to Identify Experts and Potential Experts in Community Question Answering BIBAFull-Text 10
  Aditya Pal; F. Maxwell Harper; Joseph A. Konstan
Community Question Answering (CQA) services enable their users to exchange knowledge in the form of questions and answers. These communities thrive as a result of a small number of highly active users, typically called experts, who provide a large number of high-quality useful answers. Expert identification techniques enable community managers to take measures to retain the experts in the community. There is further value in identifying the experts during the first few weeks of their participation as it would allow measures to nurture and retain them. In this article we address two problems: (a) How to identify current experts in CQA? and (b) How to identify users who have potential of becoming experts in future (potential experts)? In particular, we propose a probabilistic model that captures the selection preferences of users based on the questions they choose for answering. The probabilistic model allows us to run machine learning methods for identifying experts and potential experts. Our results over several popular CQA datasets indicate that experts differ considerably from ordinary users in their selection preferences; enabling us to predict experts with higher accuracy over several baseline models. We show that selection preferences can be combined with baseline measures to improve the predictive performance even further.
Predicting Query Performance by Query-Drift Estimation BIBAFull-Text 11
  Anna Shtok; Oren Kurland; David Carmel; Fiana Raiber; Gad Markovits
Predicting query performance, that is, the effectiveness of a search performed in response to a query, is a highly important and challenging problem. We present a novel approach to this task that is based on measuring the standard deviation of retrieval scores in the result list of the documents most highly ranked. We argue that for retrieval methods that are based on document-query surface-level similarities, the standard deviation can serve as a surrogate for estimating the presumed amount of query drift in the result list, that is, the presence (and dominance) of aspects or topics not related to the query in documents in the list. Empirical evaluation demonstrates the prediction effectiveness of our approach for several retrieval models. Specifically, the prediction quality often transcends that of current state-of-the-art prediction methods.
Authorship Attribution Based on Specific Vocabulary BIBAFull-Text 12
  Jacques Savoy
In this article we propose a technique for computing a standardized Z score capable of defining the specific vocabulary found in a text (or part thereof) compared to that of an entire corpus. Assuming that the term occurrence follows a binomial distribution, this method is then applied to weight terms (words and punctuation symbols in the current study), representing the lexical specificity of the underlying text. In a final stage, to define an author profile we suggest averaging these text representations and then applying them along with a distance measure to derive a simple and efficient authorship attribution scheme. To evaluate this algorithm and demonstrate its effectiveness, we develop two experiments, the first based on 5,408 newspaper articles (Glasgow Herald) written in English by 20 distinct authors and the second on 4,326 newspaper articles (La Stampa) written in Italian by 20 distinct authors. These experiments demonstrate that the suggested classification scheme tends to perform better than the Delta rule method based on the most frequent words, better than the chi-square distance based on word profiles and punctuation marks, better than the KLD scheme based on a predefined set of words, and better than the naïve Bayes approach.
Oracle in Image Search: A Content-Based Approach to Performance Prediction BIBAFull-Text 13
  Liqiang Nie; Meng Wang; Zheng-Jun Zha; Tat-Seng Chua
This article studies a novel problem in image search. Given a text query and the image ranking list returned by an image search system, we propose an approach to automatically predict the search performance. We demonstrate that, in order to estimate the mathematical expectations of Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG), we only need to predict the relevance probability of each image. We accomplish the task with a query-adaptive graph-based learning based on the images' ranking order and visual content. We validate our approach with a large-scale dataset that contains the image search results of 1,165 queries from 4 popular image search engines. Empirical studies demonstrate that our approach is able to generate predictions that are highly correlated with the real search performance. Based on the proposed image search performance prediction scheme, we introduce three applications: image metasearch, multilingual image search, and Boolean image search. Comprehensive experiments are conducted to validate our approach.
A Measurement Framework for Evaluating Emulators for Digital Preservation BIBAFull-Text 14
  Mark Guttenbrunner; Andreas Rauber
Accessible emulation is often the method of choice for maintaining digital objects, specifically complex ones such as applications, business processes, or electronic art. However, validating the emulator's ability to faithfully reproduce the original behavior of digital objects is complicated.
   This article presents an evaluation framework and a set of tests that allow assessment of the degree to which system emulation preserves original characteristics and thus significant properties of digital artifacts. The original system, hardware, and software properties are described. Identical environment is then recreated via emulation. Automated user input is used to eliminate potential confounders. The properties of a rendered form of the object are then extracted automatically or manually either in a target state, a series of states, or as a continuous stream. The concepts described in this article enable preservation planners to evaluate how emulation affects the behavior of digital objects compared to their behavior in the original environment. We also review how these principles can and should be applied to the evaluation of migration and other preservation strategies as a general principle of evaluating the invocation and faithful rendering of digital objects and systems. The article concludes with design requirements for emulators developed for digital preservation tasks.

TOIS 2012-08 Volume 30 Issue 3

Special issue on searching speech BIBFull-Text 15
  Martha Larson; Franciska de Jong; Wessel Kraaij; Steve Renals
Direct posterior confidence for out-of-vocabulary spoken term detection BIBAFull-Text 16
  Dong Wang; Simon King; Joe Frankel; Ravichander Vipperla; Nicholas Evans; Raphaël Troncy
Spoken term detection (STD) is a key technology for spoken information retrieval. As compared to the conventional speech transcription and keyword spotting, STD is an open-vocabulary task and has to address out-of-vocabulary (OOV) terms. Approaches based on subword units, for example phones, are widely used to solve the OOV issue; however, performance on OOV terms is still substantially inferior to that of in-vocabulary (INV) terms. The performance degradation on OOV terms can be attributed to a multitude of factors. One particular factor we address in this article is the unreliable confidence estimation caused by weak acoustic and language modeling due to the absence of OOV terms in the training corpora. We propose a direct posterior confidence derived from a discriminative model, such as multilayer perceptron (MLP). The new confidence considers a wide-range acoustic context which is usually important for speech recognition and retrieval; moreover, it localizes on detected speech segments and therefore avoids the impact of long-span word context which is usually unreliable for OOV term detection.
   In this article, we first develop an extensive discussion about the modeling weakness problem associated with OOV terms, and then propose our approach to address this problem based on direct poster confidence. Our experiments carried out on spontaneous and conversational multiparty meeting speech, demonstrate that the proposed technique provides a significant improvement in STD performance as compared to conventional lattice-based confidence, in particular for OOV terms. Furthermore, the new confidence estimation approach is fused with other advanced techniques for OOV treatment, such as stochastic pronunciation modeling and discriminative confidence normalization. This leads to an integrated solution for OOV term detection that results in a large performance improvement.
The nonverbal structure of patient case discussions in multidisciplinary medical team meetings BIBAFull-Text 17
  Saturnino Luz
Meeting analysis has a long theoretical tradition in social psychology, with established practical ramifications in computer science, especially in computer supported cooperative work. More recently, a good deal of research has focused on the issues of indexing and browsing multimedia records of meetings. Most research in this area, however, is still based on data collected in laboratories, under somewhat artificial conditions. This article presents an analysis of the discourse structure and spontaneous interactions at real-life multidisciplinary medical team meetings held as part of the work routine in a major hospital. It is hypothesized that the conversational structure of these meetings, as indicated by sequencing and duration of vocalizations, enables segmentation into individual patient case discussions. The task of segmenting audio-visual records of multidisciplinary medical team meetings is described as a topic segmentation task, and a method for automatic segmentation is proposed. An empirical evaluation based on hand labelled data is presented, which determines the optimal length of vocalization sequences for segmentation, and establishes the competitiveness of the method with approaches based on more complex knowledge sources. The effectiveness of Bayesian classification as a segmentation method, and its applicability to meeting segmentation in other domains are discussed.
Comparison of methods for language-dependent and language-independent query-by-example spoken term detection BIBAFull-Text 18
  Javier Tejedor; Michal Fapšo; Igor Szöke; Jan "Honza" Cernocký; František Grézl
This article investigates query-by-example (QbE) spoken term detection (STD), in which the query is not entered as text, but selected in speech data or spoken. Two feature extractors based on neural networks (NN) are introduced: the first producing phone-state posteriors and the second making use of a compressive NN layer. They are combined with three different QbE detectors: while the Gaussian mixture model/hidden Markov model (GMM/HMM) and dynamic time warping (DTW) both work on continuous feature vectors, the third one, based on weighted finite-state transducers (WFST), processes phone lattices. QbE STD is compared to two standard STD systems with text queries: acoustic keyword spotting and WFST-based search of phone strings in phone lattices. The results are reported on four languages (Czech, English, Hungarian, and Levantine Arabic) using standard metrics: equal error rate (EER) and two versions of popular figure-of-merit (FOM). Language-dependent and language-independent cases are investigated; the latter being particularly interesting for scenarios lacking standard resources to train speech recognition systems. While the DTW and GMM/HMM approaches produce the best results for a language-dependent setup depending on the target language, the GMM/HMM approach performs the best dealing with a language-independent setup. As far as WFSTs are concerned, they are promising as they allow for indexing and fast search.
Sibyl, a factoid question-answering system for spoken documents BIBAFull-Text 19
  Pere R. Comas; Jordi Turmo; Lluís Màrquez
In this article, we present a factoid question-answering system, Sibyl, specifically tailored for question answering (QA) on spoken-word documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken document scenario. More specifically, we study new information retrieval (IR) techniques designed or speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Sibyl is largely based on supervised machine-learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. Sibyl and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus, comparing manual with automatic transcripts obtained by three different automatic speech recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts, unless the ASR quality is very low. At the same time, our experiments on coreference resolution reveal that the state-of-the-art technology is not mature enough to be effectively exploited for QA with spoken documents. Overall, the performance of Sibyl is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

TOIS 2012-11 Volume 30 Issue 4

An Online Learning Framework for Refining Recency Search Results with User Click Feedback BIBAFull-Text 20
  Taesup Moon; Wei Chu; Lihong Li; Zhaohui Zheng; Yi Chang
Traditional machine-learned ranking systems for Web search are often trained to capture stationary relevance of documents to queries, which have limited ability to track nonstationary user intention in a timely manner. In recency search, for instance, the relevance of documents to a query on breaking news often changes significantly over time, requiring effective adaptation to user intention. In this article, we focus on recency search and study a number of algorithms to improve ranking results by leveraging user click feedback. Our contributions are threefold. First, we use commercial search engine sessions collected in a random exploration bucket for reliable offline evaluation of these algorithms, which provides an unbiased comparison across algorithms without online bucket tests. Second, we propose an online learning approach that reranks and improves the search results for recency queries near real-time based on user clicks. This approach is very general and can be combined with sophisticated click models. Third, our empirical comparison of a dozen algorithms on real-world search data suggests importance of a few algorithmic choices in these applications, including generalization across different query-document pairs, specialization to popular queries, and near real-time adaptation of user clicks for reranking.
Detecting and Tracking Topics and Events from Web Search Logs BIBAFull-Text 21
  Hongyan Liu; Jun He; Yingqin Gu; Hui Xiong; Xiaoyong Du
Recent years have witnessed increased efforts on detecting topics and events from Web search logs, since this kind of data not only capture web content but also reflect the users' activities. However, the majority of existing work is focused on exploiting clustering techniques for topic and event detection. Due to the huge size and the evolving nature of Web data, existing clustering approaches are limited to meet the real-time demand. To that end, in this article, we propose a method called LETD to detect evolving topics in a timely manner. Also, we design the techniques to extract events from topics and to infer the evolving relationship among the events. For topic detection, we first provide a measurement to select the important URLs, which are most likely to describe a real-life topic. Then, starting from these selected URLs, we exploit the local expansion method to find other topic-related URLs. Moreover, in the LETD framework, we design algorithms based on Random Walk and Markov Random Fields (MRF), respectively. Because the LETD method exploits a divide-and-conquer strategy to process the data, it is more efficient than existing methods based on clustering techniques. To better illustrate the LETD framework, we develop a demo system StoryTeller which can discover hot topics and events, infer the evolving relationships among events, and visualize information in a storytelling way. This demo system can provide a global view of the topic development and help users target the interesting events more conveniently. Finally, experimental results on real-world Microsoft click-through data have shown that StoryTeller can find real-life hot topics and meaningful evolving relationships among events, and has also demonstrated the efficiency and effectiveness of the LETD method.
Detecting Fake Medical Web Sites Using Recursive Trust Labeling BIBAFull-Text 22
  Ahmed Abbasi; Fatemeh "Mariam" Zahedi; Siddharth Kaza
Fake medical Web sites have become increasingly prevalent. Consequently, much of the health-related information and advice available online is inaccurate and/or misleading. Scores of medical institution Web sites are for organizations that do not exist and more than 90% of online pharmacy Web sites are fraudulent. In addition to monetary losses exacted on unsuspecting users, these fake medical Web sites have severe public safety ramifications. According to a World Health Organization report, approximately half the drugs sold on the Web are counterfeit, resulting in thousands of deaths. In this study, we propose an adaptive learning algorithm called recursive trust labeling (RTL). RTL uses underlying content and graph-based classifiers, coupled with a recursive labeling mechanism, for enhanced detection of fake medical Web sites. The proposed method was evaluated on a test bed encompassing nearly 100 million links between 930,000 Web sites, including 1,000 known legitimate and fake medical sites. The experimental results revealed that RTL was able to significantly improve fake medical Web site detection performance over 19 comparison content and graph-based methods, various meta-learning techniques, and existing adaptive learning approaches, with an overall accuracy of over 94%. Moreover, RTL was able to attain high performance levels even when the training dataset composed of as little as 30 Web sites. With the increased popularity of eHealth and Health 2.0, the results have important implications for online trust, security, and public safety.
Stability of Recommendation Algorithms BIBAFull-Text 23
  Gediminas Adomavicius; Jingjing Zhang
The article explores stability as a new measure of recommender systems performance. Stability is defined to measure the extent to which a recommendation algorithm provides predictions that are consistent with each other. Specifically, for a stable algorithm, adding some of the algorithm's own predictions to the algorithm's training data (for example, if these predictions were confirmed as accurate by users) would not invalidate or change the other predictions. While stability is an interesting theoretical property that can provide additional understanding about recommendation algorithms, we believe stability to be a desired practical property for recommender systems designers as well, because unstable recommendations can potentially decrease users' trust in recommender systems and, as a result, reduce users' acceptance of recommendations. In this article, we also provide an extensive empirical evaluation of stability for six popular recommendation algorithms on four real-world datasets. Our results suggest that stability performance of individual recommendation algorithms is consistent across a variety of datasets and settings. In particular, we find that model-based recommendation algorithms consistently demonstrate higher stability than neighborhood-based collaborative filtering techniques. In addition, we perform a comprehensive empirical analysis of many important factors (e.g., the sparsity of original rating data, normalization of input data, the number of new incoming ratings, the distribution of incoming ratings, the distribution of evaluation data, etc.) and report the impact they have on recommendation stability.
Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers BIBAFull-Text 24
  Tianjun Fu; Ahmed Abbasi; Daniel Zeng; Hsinchun Chen
Despite the increased prevalence of sentiment-related information on the Web, there has been limited work on focused crawlers capable of effectively collecting not only topic-relevant but also sentiment-relevant content. In this article, we propose a novel focused crawler that incorporates topic and sentiment information as well as a graph-based tunneling mechanism for enhanced collection of opinion-rich Web content regarding a particular topic. The graph-based sentiment (GBS) crawler uses a text classifier that employs both topic and sentiment categorization modules to assess the relevance of candidate pages. This information is also used to label nodes in web graphs that are employed by the tunneling mechanism to improve collection recall. Experimental results on two test beds revealed that GBS was able to provide better precision and recall than seven comparison crawlers. Moreover, GBS was able to collect a large proportion of the relevant content after traversing far fewer pages than comparison methods. GBS outperformed comparison methods on various categories of Web pages in the test beds, including collection of blogs, Web forums, and social networking Web site content. Further analysis revealed that both the sentiment classification module and graph-based tunneling mechanism played an integral role in the overall effectiveness of the GBS crawler.
Efficient Entity Translation Mining: A Parallelized Graph Alignment Approach BIBAFull-Text 25
  Gae-Won You; Seung-Won Hwang; Young-In Song; Long Jiang; Zaiqing Nie
This article studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) transliteration-based approaches that leverage phonetic similarity and (b) corpus-based approaches that exploit bilingual cooccurrences. These approaches suffer from inaccuracy and scarcity, respectively. In clear contrast, we use under-leveraged resources of monolingual entity cooccurrences crawled from entity search engines, which are represented as two entity-relationship graphs extracted from two language corpora, respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach to exploiting both transliteration similarity and monolingual cooccurrences. This approach, which builds upon monolingual corpora, complements existing corpus-based work requiring scarce resources of parallel or comparable corpus while significantly boosting the accuracy of transliteration-based work. In addition, by parallelizing the mapping process on multicore architectures, we speed up the computation by more than 10 times per unit accuracy. We validated the effectiveness and efficiency of our proposed approach using real-life datasets.
Aggregation Methods for Proximity-Based Opinion Retrieval BIBAFull-Text 26
  Shima Gerani; Mark Carman; Fabio Crestani
The enormous amount of user-generated data available on the Web provides a great opportunity to understand, analyze, and exploit people's opinions on different topics. Traditional Information Retrieval methods consider the relevance of documents to a topic but are unable to differentiate between subjective and objective documents. Opinion retrieval is a retrieval task in which not only the relevance of a document to the topic is important but also the amount of opinion expressed in the document about the topic. In this article, we address the blog post opinion retrieval task and propose methods that rank blog posts according to their relevance and opinionatedness toward a topic. We propose estimating the opinion density at each position in a document using a general opinion lexicon and kernel density functions. We propose and investigate different models for aggregating the opinion density at query terms positions to estimate the opinion score of every document. We then combine the opinion score with the relevance score based on a probabilistic justification. Experimental results on the BLOG06 dataset show that the proposed method provides significant improvement over the standard TREC baselines. The proposed models also achieve much higher performance compared to all state of the art methods.