HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 15161718192021222324252627282930313233

ACM Transactions on Information Systems 25

Editors:Gary Marchionini
Standard No:ISSN 1046-8188; HF S548.125 A33
Links:Table of Contents
  1. TOIS 2007 Volume 25 Issue 1
  2. TOIS 2007 Volume 25 Issue 2
  3. TOIS 2007 Volume 25 Issue 3
  4. TOIS 2007 Volume 25 Issue 4

TOIS 2007 Volume 25 Issue 1

Precision recall with user modeling (PRUM): Application to structured information retrieval BIBAFull-Text 1
  B. Piwowarski; P. Gallinari; G. Dupret
Standard Information Retrieval (IR) metrics are not well suited for new paradigms like XML or Web IR in which retrievable information units are document elements and/or sets of related documents. Part of the problem stems from the classical hypotheses on the user models: They do not take into account the structural or logical context of document elements or the possibility of navigation between units. This article proposes an explicit and formal user model that encompasses a large variety of user behaviors. Based on this model, we extend the probabilistic precision-recall metric to deal with the new IR paradigms.
Named entity translation matching and learning: With application for mining unseen translations BIBAFull-Text 2
  Wai Lam; Shing-Kit Chan; Ruizhang Huang
This article introduces a named entity matching model that makes use of both semantic and phonetic evidence. The matching of semantic and phonetic information is captured by a unified framework via a bipartite graph model. By considering various technical challenges of the problem, including order insensitivity and partial matching, this approach is less rigid than existing approaches and highly robust. One major component is a phonetic matching model which exploits similarity at the phoneme level. Two learning algorithms for learning the similarity information of basic phonemic matching units based on training examples are investigated. By applying the proposed named entity matching model, a mining system is developed for discovering new named entity translations from daily Web news. The system is able to discover new name translations that cannot be found in the existing bilingual dictionary.
An empirical investigation of user term feedback in text-based targeted image search BIBAFull-Text 3
  Joyce Y. Chai; Chen Zhang; Rong Jin
Text queries are natural and intuitive for users to describe their information needs. However, text-based image retrieval faces many challenges. Traditional text retrieval techniques on image descriptions have not been very successful. This is mainly due to the inconsistent textual descriptions and the discrepancies between user queries and terms in the descriptions. To investigate strategies to alleviate this vocabulary problem, this article examines the role of user term feedback in targeted image search that is based on text-based image retrieval. Term feedback refers to the feedback from a user on specific terms regarding their relevance to a target image. Previous studies have indicated the effectiveness of term feedback in interactive text retrieval. However, in our experiments on text-based image retrieval, the term feedback has not been shown to be effective. Our results indicate that, although term feedback has a positive effect by allowing users to identify more relevant terms, it also has a strong negative effect by providing more opportunities for users to specify irrelevant terms. To understand these different effects and their implications, this article further analyzes important factors that contribute to the utility of term feedback and discusses the outlook of term feedback in interactive text-based image retrieval.
Creating and exploiting a comparable corpus in cross-language information retrieval BIBAFull-Text 4
  Tuomas Talvensaari; Jorma Laurikkala; Kalervo Järvelin; Martti Juhola; Heikki Keskustalo
We present a method for creating a comparable text corpus from two document collections in different languages. The collections can be very different in origin. In this study, we build a comparable corpus from articles by a Swedish news agency and a U.S. newspaper. The keys with best resolution power were extracted from the documents of one collection, the source collection, by using the relative average term frequency (RATF) value. The keys were translated into the language of the other collection, the target collection, with a dictionary-based query translation program. The translated queries were run against the target collection and an alignment pair was made if the retrieved documents matched given date and similarity score criteria. The resulting comparable collection was used as a similarity thesaurus to translate queries along with a dictionary-based translator. The combined approaches outperformed translation schemes where dictionary-based translation or corpus translation was used alone.
Interest-based personalized search BIBAFull-Text 5
  Zhongming Ma; Gautam Pant; Olivia R. Liu Sheng
Web search engines typically provide search results without considering user interests or context. We propose a personalized search approach that can easily extend a conventional search engine on the client side. Our mapping framework automatically maps a set of known user interests onto a group of categories in the Open Directory Project (ODP) and takes advantage of manually edited data available in ODP for training text classifiers that correspond to, and therefore categorize and personalize search results according to user interests. In two sets of controlled experiments, we compare our personalized categorization system (PCAT) with a list interface system (LIST) that mimics a typical search engine and with a nonpersonalized categorization system (CAT). In both experiments, we analyze system performances on the basis of the type of task and query length. We find that PCAT is preferable to LIST for information gathering types of tasks and for searches with short queries, and PCAT outperforms CAT in both information gathering and finding types of tasks, and for searches associated with free-form queries. From the subjects' answers to a questionnaire, we find that PCAT is perceived as a system that can find relevant Web pages quicker and easier than LIST and CAT.

TOIS 2007 Volume 25 Issue 2

An exploration of the principles underlying redundancy-based factoid question answering BIBAFull-Text 6
  Jimmy Lin
The so-called "redundancy-based" approach to question answering represents a successful strategy for mining answers to factoid questions such as "Who shot Abraham Lincoln?" from the World Wide Web. Through contrastive and ablation experiments with Aranea, a system that has performed well in several TREC QA evaluations, this work examines the underlying assumptions and principles behind redundancy-based techniques. Specifically, we develop two theses: that stable characteristics of data redundancy allow factoid systems to rely on external "black box" components, and that despite embodying a data-driven approach, redundancy-based methods encode a substantial amount of knowledge in the form of heuristics. Overall, this work attempts to address the broader question of "what really matters" and to provide guidance for future researchers.
Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search BIBAFull-Text 7
  Thorsten Joachims; Laura Granka; Bing Pan; Helene Hembrooke; Filip Radlinski; Geri Gay
This article examines the reliability of implicit feedback generated from clickthrough data and query reformulations in World Wide Web (WWW) search. Analyzing the users' decision process using eyetracking and comparing implicit feedback against manual relevance judgments, we conclude that clicks are informative but biased. While this makes the interpretation of clicks as absolute relevance judgments difficult, we show that relative preferences derived from clicks are reasonably accurate on average. We find that such relative preferences are accurate not only between results from an individual query, but across multiple sets of results within chains of query reformulations.
Soft pattern matching models for definitional question answering BIBAFull-Text 8
  Hang Cui; Min-Yen Kan; Tat-Seng Chua
We explore probabilistic lexico-syntactic pattern matching, also known as soft pattern matching, in a definitional question answering system. Most current systems use regular expression-based hard matching patterns to identify definition sentences. Such rigid surface matching often fares poorly when faced with language variations. We propose two soft matching models to address this problem: one based on bigrams and the other on the Profile Hidden Markov Model (PHMM). Both models provide a theoretically sound method to model pattern matching as a probabilistic process that generates token sequences. We demonstrate the effectiveness of the models on definition sentence retrieval for definitional question answering. We show that both models significantly outperform the state-of-the-art manually constructed hard matching patterns on recent TREC data.
   A critical difference between the two models is that the PHMM has a more complex topology. We experimentally show that the PHMM can handle language variations more effectively but requires more training data to converge.
   While we evaluate soft pattern models only on definitional question answering, we believe that both models are generic and can be extended to other areas where lexico-syntactic pattern matching can be applied.
Automatic classification of Web queries using very large unlabeled query logs BIBAFull-Text 9
  Steven M. Beitzel; Eric C. Jensen; David D. Lewis; Abdur Chowdhury; Ophir Frieder
Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system must route queries to a subset of topic-specific and resource-constrained back-end databases. Successful query classification poses a challenging problem, as Web queries are short, thus providing few features. This feature sparseness, coupled with the constantly changing distribution and vocabulary of queries, hinders traditional text classification. We attack this problem by combining multiple classifiers, including exact lookup and partial matching in databases of manually classified frequent queries, linear models trained by supervised learning, and a novel approach based on mining selectional preferences from a large unlabeled query log. Our approach classifies queries without using external sources of information, such as online Web directories or the contents of retrieved pages, making it viable for use in demanding operational environments, such as large-scale Web search services. We evaluate our approach using a large sample of queries from an operational Web search engine and show that our combined method increases recall by nearly 40% over the best single method while maintaining adequate precision. Additionally, we compare our results to those from the 2005 KDD Cup and find that we perform competitively despite our operational restrictions. This suggests it is possible to topically classify a significant portion of the query stream without requiring external sources of information, allowing for deployment in operationally restricted environments.

TOIS 2007 Volume 25 Issue 3

Answering XML queries by means of data summaries BIBAFull-Text 10
  Elena Baralis; Paolo Garza; Elisa Quintarelli; Letizia Tanca
XML is a rather verbose representation of semistructured data, which may require huge amounts of storage space. We propose a summarized representation of XML data, based on the concept of instance pattern, which can both provide succinct information and be directly queried. The physical representation of instance patterns exploits itemsets or association rules to summarize the content of XML datasets. Instance patterns may be used for (possibly partially) answering queries, either when fast and approximate answers are required, or when the actual dataset is not available, for example, it is currently unreachable. Experiments on large XML documents show that instance patterns allow a significant reduction in storage space, while preserving almost entirely the completeness of the query result. Furthermore, they provide fast query answers and show good scalability on the size of the dataset, thus overcoming the document size limitation of most current XQuery engines.
Online supervised spam filter evaluation BIBAFull-Text 11
  Gordon V. Cormack; Thomas R. Lynam
Eleven variants of six widely used open-source spam filters are tested on a chronological sequence of 49086 e-mail messages received by an individual from August 2003 through March 2004. Our approach differs from those previously reported in that the test set is large, comprises uncensored raw messages, and is presented to each filter sequentially with incremental feedback. Misclassification rates and Receiver Operating Characteristic Curve measurements are reported, with statistical confidence intervals. Quantitative results indicate that content-based filters can eliminate 98% of spam while incurring 0.1% legitimate email loss. Qualitative results indicate that the risk of loss depends on the nature of the message, and that messages likely to be lost may be those that are less critical. More generally, our methodology has been encapsulated in a free software toolkit, which may used to conduct similar experiments.
Discovering personally meaningful places: An interactive clustering approach BIBAFull-Text 12
  Changqing Zhou; Dan Frankowski; Pamela Ludford; Shashi Shekhar; Loren Terveen
The discovery of a person's meaningful places involves obtaining the physical locations and their labels for a person's places that matter to his daily life and routines. This problem is driven by the requirements from emerging location-aware applications, which allow a user to pose queries and obtain information in reference to places, for example, "home", "work" or "Northwest Health Club". It is a challenge to map from physical locations to personally meaningful places due to a lack of understanding of what constitutes the real users' personally meaningful places. Previous work has explored algorithms to discover personal places from location data. However, we know of no systematic empirical evaluations of these algorithms, leaving designers of location-aware applications in the dark about their choices.
   Our work remedies this situation. We extended a clustering algorithm to discover places. We also defined a set of essential evaluation metrics and an interactive evaluation framework. We then conducted a large-scale experiment that collected real users' location data and personally meaningful places, and illustrated the utility of our evaluation framework. Our results establish a baseline that future work can measure itself against. They also demonstrate that our algorithm discovers places with reasonable accuracy and outperforms the well-known K-Means clustering algorithm for place discovery. Finally, we provide evidence that shapes more complex than "points" are required to represent the full range of people's everyday places.
On setting the hyper-parameters of term frequency normalization for information retrieval BIBAFull-Text 13
  Ben He; Iadh Ounis
The setting of the term frequency normalization hyper-parameter suffers from the query dependence and collection dependence problems, which remarkably hurt the robustness of the retrieval performance. Our study in this article investigates three term frequency normalization methods, namely normalization 2, BM25's normalization and the Dirichlet Priors normalization. We tackle the query dependence problem by modifying the query term weight using a Divergence From Randomness term weighting model, and tackle the collection dependence problem by measuring the correlation of the normalized term frequency with the document length. Our research hypotheses for the two problems, as well as an automatic hyper-parameter setting methodology, are extensively validated and evaluated on four Text REtrieval Conference (TREC) collections.
Temporal profiles of queries BIBAFull-Text 14
  Rosie Jones; Fernando Diaz
Documents with timestamps, such as email and news, can be placed along a timeline. The timeline for a set of documents returned in response to a query gives an indication of how documents relevant to that query are distributed in time. Examining the timeline of a query result set allows us to characterize both how temporally dependent the topic is, as well as how relevant the results are likely to be. We outline characteristic patterns in query result set timelines, and show experimentally that we can automatically classify documents into these classes. We also show that properties of the query result set timeline can help predict the mean average precision of a query. These results show that meta-features associated with a query can be combined with text retrieval techniques to improve our understanding and treatment of text search on documents with timestamps.

TOIS 2007 Volume 25 Issue 4

Adaptive hypermedia through contextualized open hypermedia structures BIBAFull-Text 16
  Christopher Bailey; Wendy Hall; David E. Millard; Mark J. Weal
contextually-aware open hypermedia (OH) perspective. We believe that a wide range of AH techniques can be supported with a small number of OH structures, which can be combined together to create more complex effects, possibly simplifying the development of new AH systems.
   In this work we reexamine Brusilovsky's taxonomy of AH techniques from a structural OH perspective. We also show that it is possible to identify and model common structures across the taxonomy of adaptive techniques. An agent-based adaptive hypermedia system called HA3L is presented, which uses these OH structures to provide a straightforward implementation of a variety of adaptive hypermedia techniques. This enables us to reflect on the structural equivalence of many of the techniques, demonstrates the advantages of the OH approach, and can inform the design of future adaptive hypermedia systems.
ServiceFinder: A method towards enhancing service portals BIBAFull-Text 17
  Xiao Fang; Olivia R. Liu Sheng; Michael Chau
The rapid advancement of Internet technologies enables more and more educational institutes, companies, and government agencies to provide services, namely online services, through web portals. With hundreds of online services provided through a web portal, it is critical to design web portals, namely service portals, through which online services can be easily accessed by their consumers. This article addresses this critical issue from the perspective of service selection, that is, how to select a small number of service-links (i.e., hyperlinks pointing to online services) to be featured in the homepage of a service portal such that users can be directed to find the online services they seek most effectively. We propose a mathematically formulated metric to measure the effectiveness of the selected service-links in directing users to locate their desired online services and formally define the service selection problem. A solution method, ServiceFinder, is then proposed. Using real-world data obtained from the Utah State Government service portal, we show that ServiceFinder outperforms both the current practice of service selection and previous algorithms for adaptive website design. We also show that the performance of ServiceFinder is close to that of the optimal solution resulting from exhaustive search.
YASS: Yet another suffix stripper BIBAFull-Text 18
  Prasenjit Majumder; Mandar Mitra; Swapan K. Parui; Gobinda Kole; Pabitra Mitra; Kalyankumar Datta
Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers with comprehensive rules are available only for a few languages. In the absence of extensive linguistic resources for certain languages, statistical language processing tools have been successfully used to improve the performance of IR systems. In this article, we describe a clustering-based approach to discover equivalence classes of root words and their morphological variants. A set of string distance measures are defined, and the lexicon for a given text collection is clustered using the distance measures to identify these equivalence classes. The proposed approach is compared with Porter's and Lovin's stemmers on the AP and WSJ subcollections of the Tipster dataset using 200 queries. Its performance is comparable to that of Porter's and Lovin's stemmers, both in terms of average precision and the total number of relevant documents retrieved. The proposed stemming algorithm also provides consistent improvements in retrieval performance for French and Bengali, which are currently resource-poor.
A novel XML music information retrieval method using graph invariants BIBAFull-Text 19
  Alberto Pinto; Goffredo Haus
The increasing diffusion of XML languages for the encoding of domain-specific multimedia information raises the need for new information retrieval models that can fully exploit structural information. An XML language specifically designed for music like MX allows queries to be made directly on the thematic material. The main advantage of such a system is that it can handle symbolic, notational, and audio objects at the same time through a multilayered structure. On the model side, common music information retrieval methods do not take into account the inner structure of melodic themes and the metric relationships between notes.
   In this article we deal with two main topics: a novel architecture based on a new XML language for music and a new model of melodic themes based on graph theory.
   This model takes advantage of particular graph invariants that can be linked to melodic themes as metadata in order to characterize all their possible modifications through specific transformations and that can be exploited in filtering algorithms. We provide a similarity function and show through an evaluation stage how it improves existing methods, particularly in the case of same-structured themes.
Reducing human interactions in Web directory searches BIBAFull-Text 20
  Ori Gerstel; Shay Kutten; Eduardo Sany Laber; Rachel Matichin; David Peleg; Artur Alves Pessoa; Criston Souza
Consider a website containing a collection of webpages with data such as in Yahoo or the Open Directory project. Each page is associated with a weight representing the frequency with which that page is accessed by users. In the tree hierarchy representation, accessing each page requires the user to travel along the path leading to it from the root. By enhancing the index tree with additional edges (hotlinks) one may reduce the access cost of the system. In other words, the hotlinks reduce the expected number of steps needed to reach a leaf page from the tree root, assuming that the user knows which hotlinks to take. The hotlink enhancement problem involves finding a set of hotlinks minimizing this cost.
   This article proposes the first exact algorithm for the hotlink enhancement problem. This algorithm runs in polynomial time for trees with logarithmic depth. Experiments conducted with real data show that significant improvement in the expected number of accesses per search can be achieved in websites using this algorithm. These experiments also suggest that the simple and much faster heuristic proposed previously by Czyzowicz et al. [2003] creates hotlinks that are nearly optimal in the time savings they provide to the user.
   The version of the hotlink enhancement problem in which the weight distribution on the leaves is unknown is discussed as well. We present a polynomial-time algorithm that is optimal for any tree for any depth.