HCI Bibliography Home | HCI Conferences | IR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
IR Tables of Contents: 939495969798990001020304050607080910111213

Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Fullname:Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval
Editors:Jamie Callan; David Hawking; Alan Smeaton
Location:Toronto, Canada
Dates:2003-Jul-28 to 2003-Aug-01
Standard No:ISBN 1-58113-646-3; ACM Order Number: 534032; ACM DL: Table of Contents hcibib: IR03
  1. Keynote Address
  2. Salton Award Lecture
  3. Retrieval models
  4. Question answering
  5. Web
  6. Human interaction
  7. Text categorization
  8. Multimedia information retrieval
  9. Structured documents
  10. Text representation
  11. Text categorization
  12. Human interaction
  13. IR theory
  14. Filtering and retrieval models
  15. Clustering
  16. Distributed information retrieval
  17. Novelty and topic change
  18. Cross-lingual information retrieval
  19. Posters
  20. Demos

Keynote Address

exploring, modeling, and using the web graph BIBAFull-Text 1
  Andrei Broder
The Web graph, meaning the graph induced by Web pages as nodes and their hyperlinks as directed edges, has become a fascinating object of study for many people: physicists, sociologists, mathematicians, computer scientists, and information retrieval specialists.
   Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).The goal of this talk is to convey an introduction to the state of the art in this area and to sketch the current issues in collecting, representing, analyzing, and modeling this graph. Although graph analytic methods are essential tools in the Web IR arsenal, they are well known to the SIGIR community and will not be discussed here in any detail; instead, we will explore some challenges and opportunities for using IR methods and techniques in the exploration of the Web graph, in particular in dealing with legitimate and "spam" perturbations of the "natural" process of birth and death of nodes and links, and conversely, the challenges and opportunities of using graph methods in support of IR on the Web and in the enterprise.

Salton Award Lecture

Information retrieval and computer science: an evolving relationship BIBAFull-Text 2-3
  W. Bruce Croft
Following the tradition of these acceptance talks, I will be giving my thoughts on where our field is going. Any discussion of the future of information retrieval (IR) research, however, needs to be placed in the context of its history and relationship to other fields. Although IR has had a very strong relationship with library and information science, its relationship to computer science (CS) and its relative standing as a sub-discipline of CS has been more dynamic. IR is quite an old field, and when a number of CS departments were forming in the 60s, it was not uncommon for a faculty member to be pursuing research related to IR. Early ACM curriculum recommendations for CS contained courses on information retrieval, and encyclopedias described IR and database systems as different aspects of the same field.

Retrieval models

Bayesian extension to the language model for ad hoc information retrieval BIBAFull-Text 4-9
  Hugo Zaragoza; Djoerd Hiemstra; Michael Tipping
We propose a Bayesian extension to the ad-hoc Language Model. Many smoothed estimators used for the multinomial query model in ad-hoc Language Models (including Laplace and Bayes-smoothing) are approximations to the Bayesian predictive distribution. In this paper we derive the full predictive distribution in a form amenable to implementation by classical IR models, and then compare it to other currently used estimators. In our experiments the proposed model outperforms Bayes-smoothing, and its combination with linear interpolation smoothing outperforms all other estimators.
Beyond independent relevance: methods and evaluation metrics for subtopic retrieval BIBAFull-Text 10-17
  Cheng Xiang Zhai; William W. Cohen; John Lafferty
We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in a ranking is dependent on other documents in the ranking, violating the assumption of independent relevance which is assumed in most traditional retrieval methods. Subtopic retrieval poses challenges for evaluating performance, as well as for developing effective algorithms. We propose a framework for evaluating subtopic retrieval which generalizes the traditional precision and recall metrics by accounting for intrinsic topic difficulty as well as redundancy in documents. We propose and systematically evaluate several methods for performing subtopic retrieval using statistical language models and a maximal marginal relevance (MMR) ranking strategy. A mixture model combined with query likelihood relevance ranking is shown to modestly outperform a baseline relevance ranking on a data set used in the TREC interactive track.
Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model BIBAFull-Text 18-25
  Jaime Teevan; David R. Karger
Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular retrieval algorithm. We explore the explicit assumptions underlying the naive framework by performing computational analysis of actual corpora and queries to devise a generative document model that closely matches text. Our thesis is that a model so developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications. We test this by learning from a corpus the best document model. We find the learned model better predicts the existence of text data and has improved performance on certain IR tasks.

Question answering

Question classification using support vector machines BIBAFull-Text 26-32
  Dell Zhang; Wee Sun Lee
Question classification is very important for question answering. This paper presents our research work on automatic question classification through machine learning approaches. We have experimented with five machine learning algorithms: Nearest Neighbors (NN), Naive Bayes (NB), Decision Tree (DT), Sparse Network of Winnows (SNoW), and Support Vector Machines (SVM) using two kinds of features: bag-of-words and bag-of-ngrams. The experiment results show that with only surface text features the SVM outperforms the other four methods for this task. Further, we propose to use a special kernel function called the tree kernel to enable the SVM to take advantage of the syntactic structures of questions. We describe how the tree kernel can be computed efficiently by dynamic programming. The performance of our approach is promising, when tested on the questions from the TREC QA track.
Structured use of external knowledge for event-based open domain question answering BIBAFull-Text 33-40
  Hui Yang; Tat-Seng Chua; Shuguang Wang; Chun-Keat Koh
One of the major problems in question answering (QA) is that the queries are either too brief or often do not contain most relevant terms in the target corpus. In order to overcome this problem, our earlier work integrates external knowledge extracted from the Web and WordNet to perform Event-based QA on the TREC-11 task. This paper extends our approach to perform event-based QA by uncovering the structure within the external knowledge. The knowledge structure loosely models different facets of QA events, and is used in conjunction with successive constraint relaxation algorithm to achieve effective QA. Our results obtained on TREC-11 QA corpus indicate that the new approach is more effective and able to attain a confidence-weighted score of above 80%.
Quantitative evaluation of passage retrieval algorithms for question answering BIBAFull-Text 41-47
  Stefanie Tellex; Boris Katz; Jimmy Lin; Aaron Fernandes; Gregory Marton
Passage retrieval is an important component common to many question answering systems. Because most evaluations of question answering systems focus on end-to-end performance, comparison of common components becomes difficult. To address this shortcoming, we present a quantitative evaluation of various passage retrieval algorithms for question answering, implemented in a framework called Pauchok. We present three important findings: Boolean querying schemes perform well in the question answering task. The performance differences between various passage retrieval algorithms vary with the choice of document retriever, which suggests significant interactions between document retrieval and passage retrieval. The best algorithms in our evaluation employ density-based measures for scoring query terms. Our results reveal future directions for passage retrieval and question answering.


Building a web thesaurus from web link structure BIBAFull-Text 48-55
  Zheng Chen; Shengping Liu; Liu Wenyin; Geguang Pu; Wei-Ying Ma
Thesaurus has been widely used in many applications, including information retrieval, natural language processing, and question answering. In this paper, we propose a novel approach to automatically constructing a domain-specific thesaurus from the Web using link structure information. The proposed approach is able to identify new terms and reflect the latest relationship between terms as the Web evolves. First, a set of high quality and representative websites of a specific domain is selected. After filtering out navigational links, link analysis is applied to each website to obtain its content structure. Finally, the thesaurus is constructed by merging the content structures of the selected websites. The experimental results on automatic query expansion based on our constructed thesaurus show 20% improvement in search precision compared to the baseline.
Implicit link analysis for small web search BIBAFull-Text 56-63
  Gui-Rong Xue; Hua-Jun Zeng; Zheng Chen; Wei-Ying Ma; Hong-Jiang Zhang; Chao-Jun Lu
Current Web search engines generally impose link analysis-based re-ranking on web-page retrieval. However, the same techniques, when applied directly to small web search such as intranet and site search, cannot achieve the same performance because their link structures are different from the global Web. In this paper, we propose an approach to constructing implicit links by mining users' access patterns, and then apply a modified PageRank algorithm to re-rank web-pages for small web search. Our experimental results indicate that the proposed method outperforms content-based method by 16%, explicit link-based PageRank by 20% and DirectHit by 14%, respectively.
Query type classification for web document retrieval BIBAFull-Text 64-71
  In-Ho Kang; GilChang Kim
The heterogeneous Web exacerbates IR problems and short user queries make them worse. The contents of web documents are not enough to find good answer documents. Link information and URL information compensates for the insufficiencies of content information. However, static combination of multiple evidences may lower the retrieval performance. We need different strategies to find target documents according to a query type. We can classify user queries as three categories, the topic relevance task, the homepage finding task, and the service finding task. In this paper, a user query classification scheme is proposed. This scheme uses the difference of distribution, mutual information, the usage rate as anchor texts, and the POS information for the classification. After we classified a user query, we apply different algorithms and information for the better results. For the topic relevance task, we emphasize the content information, on the other hand, for the homepage finding task, we emphasize the Link information and the URL information. We could get the best performance when our proposed classification method with the OKAPI scoring algorithm was used.

Human interaction

Stuff I've seen: a system for personal information retrieval and re-use BIBAFull-Text 72-79
  Susan Dumais; Edward Cutrell; JJ Cadiz; Gavin Jancke; Raman Sarin; Daniel C. Robbins
Most information retrieval technologies are designed to facilitate information discovery. However, much knowledge work involves finding and re-using previously seen information. We describe the design and evaluation of a system, called Stuff I've Seen (SIS), that facilitates information re-use. This is accomplished in two ways. First, the system provides a unified index of information that a person has seen, whether it was seen as email, web page, document, appointment, etc. Second, because the information has been seen before, rich contextual cues can be used in the search interface. The system has been used internally by more than 230 employees. We report on both qualitative and quantitative aspects of system use. Initial findings show that time and people are important retrieval cues. Users find information more easily using SIS, and use other search tools less frequently after installation.
Search strategies in content-based image retrieval BIBAFull-Text 80-87
  Sharon McDonald; John Tait
This paper describes two studies that looked at users' ability to formulate visual queries with a Content-Based Image Retrieval system that uses dominant image colour as the primary indexing key. The first experiment examined users' performance with two visual search tools, a sketch tool and a structured browsing tool, with different types of image query. The results showed that while users were able to successfully search on the basis of colour, and were able to formulate visual queries, their ability to do so was affected by search task type. Search task type was also shown to be related to search tool choice. However, the results of study two showed that while users were able to complete all of the tasks, there was evidence to suggest that a degree of compromise was present in the users' choice of image that was largely due to problems relating to query formulation.
Using terminological feedback for web search refinement: a log-based study BIBAFull-Text 88-95
  Peter Anick
Although interactive query reformulation has been actively studied in the laboratory, little is known about the actual behavior of web searchers who are offered terminological feedback along with their search results. We analyze log sessions for two groups of users interacting with variants of the AltaVista search engine -- a baseline group given no terminological feedback and a feedback group to whom twelve refinement terms are offered along with the search results. We examine uptake, refinement effectiveness, conditions of use, and refinement type preferences. Although our measure of overall session "success" shows no difference between outcomes for the two groups, we find evidence that a subset of those users presented with terminological feedback do make effective use of it on a continuing basis.

Text categorization

A scalability analysis of classifiers in text categorization BIBAFull-Text 96-103
  Yiming Yang; Jian Zhang; Bryan Kisiel
Real-world applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, k-nearest neighbor, ridge regression, linear least square fit and logistic regression. By providing a formal analysis of the computational complexity of each classification method, followed by an investigation on the usage of different classifiers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classifiers on the OHSUMED corpus are reported on, as concrete examples.
A repetition based measure for verification of text collections and for text categorization BIBAFull-Text 104-110
  Dmitry V. Khmelev; William J. Teahan
We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.
Using asymmetric distributions to improve text classifier probability estimates BIBAFull-Text 111-118
  Paul N. Bennett
Text classifiers that give probability estimates are more readily applicable in a variety of scenarios. For example, rather than choosing one set decision threshold, they can be used in a Bayesian risk model to issue a run-time decision which minimizes a user-specified cost function dynamically chosen at prediction time. However, the quality of the probability estimates is crucial. We review a variety of standard approaches to converting scores (and poor probability estimates) from text classifiers to high quality estimates and introduce new models motivated by the intuition that the empirical score distribution for the "extremely irrelevant", "hard to discriminate", and "obviously relevant" items are often significantly different. Finally, we analyze the experimental performance of these models over the outputs of two text classifiers. The analysis demonstrates that one of these models is theoretically attractive (introducing few new parameters while increasing flexibility), computationally efficient, and empirically preferable.

Multimedia information retrieval

Automatic image annotation and retrieval using cross-media relevance models BIBAFull-Text 119-126
  J. Jeon; V. Lavrenko; R. Manmatha
Libraries have traditionally used manual image annotation for indexing and then later retrieving their image collections. However, manual image annotation is an expensive and labor intensive procedure and hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose an automatic approach to annotating and retrieving images based on a training set of images. We assume that regions in an image can be described using a small vocabulary of blobs. Blobs are generated from image features using clustering. Given a training set of images with annotations, we show that probabilistic models allow us to predict the probability of generating a word given the blobs in an image. This may be used to automatically annotate and retrieve images given a word as a query. We show that relevance models allow us to derive these probabilities in a natural way. Experiments show that the annotation performance of this cross-media relevance model is almost six times as good (in terms of mean precision) than a model based on word-blob co-occurrence model and twice as good as a state of the art model derived from machine translation. Our approach shows the usefulness of using formal information retrieval models for the task of image annotation and retrieval.
Modeling annotated data BIBAFull-Text 127-134
  David M. Blei; Michael I. Jordan
We consider the problem of modeling annotated data -- data with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). We describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent Dirichlet allocation, a latent variable model that is effective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. We conduct experiments on the Corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval.
Experimental result analysis for a generative probabilistic image retrieval model BIBAFull-Text 135-142
  Thijs Westerveld; Arjen P. de Vries
The main conclusion from the metrics-based evaluation of video retrieval systems at TREC's video track is that non-interactive image retrieval from general collections using visual information only is not yet feasible. We show how a detailed analysis of retrieval results -- looking beyond mean average precision (MAP) scores on topical relevance -- gives significant insight in the main problems with the visual part of the retrieval model under study. Such an analytical approach proves an important addition to standard evaluation measures.

Structured documents

Combining document representations for known-item search BIBAFull-Text 143-150
  Paul Ogilvie; Jamie Callan
This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses from those research areas and investigate them in this context. To investigate these hypotheses, we present a mixture-based language model and also examine many of the current meta-search algorithms. We find that compatible output from systems is important for successful combination of document representations. We also demonstrate that combining low performing document representations can improve performance, but not consistently. We find that the techniques best suited for this task are robust to the inclusion of poorly performing document representations. We also explore the role of variance of results across systems and its impact on the performance of fusion, with the surprising result that the correct documents have higher variance across document representations than highly ranking incorrect documents.
Searching XML documents via XML fragments BIBAFull-Text 151-158
  David Carmel; Yoelle S. Maarek; Matan Mandelbrod; Yosi Mass; Aya Soffer
Most of the work on XML query and search has stemmed from the publishing and database communities, mostly for the needs of business applications. Recently, the Information Retrieval community began investigating the XML search issue to answer information discovery needs. Following this trend, we present here an approach where information needs can be expressed in an approximate manner as pieces of XML documents or "XML fragments" of the same nature as the documents that are being searched. We present an extension of the vector space model for searching XML collections via XML fragments and ranking results by relevance. We describe how we have extended a full-text search engine to comply with this model. The value of the proposed method is demonstrated by the relative high precision of our system, which was among the top performers in the recent INEX workshop. Our results indicate that certain queries are more appropriate than others for the extended vector space model. Specifically, queries with relatively specific contexts but vague information needs are best situated to reap the benefit of this model. Finally our results show that one method may not fit all types of queries and that it could be worthwhile to use different solutions for different applications.

Text representation

Word sense disambiguation in information retrieval revisited BIBAFull-Text 159-166
  Christopher Stokoe; Michael P. Oakes; John Tait
Word sense ambiguity is recognized as having a detrimental effect on the precision of information retrieval systems in general and web search systems in particular, due to the sparse nature of the queries involved. Despite continued research into the application of automated word sense disambiguation, the question remains as to whether less than 90% accurate automated word sense disambiguation can lead to improvements in retrieval effectiveness. In this study we explore the development and subsequent evaluation of a statistical word sense disambiguation system which demonstrates increased precision from a sense based vector space retrieval model over traditional TF*IDF techniques.
Probabilistic term variant generator for biomedical terms BIBAFull-Text 167-173
  Yoshimasa Tsuruoka; Jun'ichi Tsujii
This paper presents an algorithm to generate possible variants for biomedical terms. The algorithm gives each variant its generation probability representing its plausibility, which is potentially useful for query and dictionary expansions. The probabilistic rules for generating variants are automatically learned from raw texts using an existing abbreviation extraction technique. Our method, therefore, requires no linguistic knowledge or labor-intensive natural language resource. We conducted an experiment using 83,142 MEDLINE abstracts for rule induction and 18,930 abstracts for testing. The results indicate that our method will significantly increase the number of retrieved documents for long biomedical terms.

Text categorization

A maximal figure-of-merit learning approach to text categorization BIBAFull-Text 174-181
  Sheng Gao; Wen Wu; Chin-Hui Lee; Tat-Seng Chua
A novel maximal figure-of-merit (MFoM) learning approach to text categorization is proposed. Different from the conventional techniques, the proposed MFoM method attempts to integrate any performance metric of interest (e.g. accuracy, recall, precision, or F1 measure) into the design of any classifier. The corresponding classifier parameters are learned by optimizing an overall objective function of interest. To solve this highly nonlinear optimization problem, we use a generalized probabilistic descent algorithm. The MFoM learning framework is evaluated on the Reuters-21578 task with LSI-based feature extraction and a binary tree classifier. Experimental results indicate that the MFoM classifier gives improved F1 and enhanced robustness over the conventional one. It also outperforms the popular SVM method in micro-averaging F1. Other extensions to design discriminative multiple-category MFoM classifiers for application scenarios with new performance metrics could be envisioned too.
Text categorization by boosting automatically extracted concepts BIBAFull-Text 182-189
  Lijuan Cai; Thomas Hofmann
Term-based representations of documents have found wide-spread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to variations in word usage.
   In this paper we investigate the use of concept-based document representations to supplement word- or phrase-based features. The utilized concepts are automatically extracted from documents via probabilistic latent semantic analysis. We propose to use AdaBoost to optimally combine weak hypotheses based on both types of features. Experimental results on standard benchmarks confirm the validity of our approach, showing that AdaBoost achieves consistent improvements by including additional semantic features in the learned ensemble.
Robustness of regularized linear classification methods in text categorization BIBAFull-Text 190-197
  Jian Zhang; Yiming Yang
Real-world applications often require the classification of documents under situations of small number of features, mis-labeled documents and rare positive examples. This paper investigates the robustness of three regularized linear classification methods (SVM, ridge regression and logistic regression) under above situations. We compare these methods in terms of their loss functions and score distributions, and establish the connection between their optimization problems and generalization error bounds. Several sets of controlled experiments on the Reuters-21578 corpus are conducted to investigate the robustness of these methods. Our results show that ridge regression seems to be the most promising candidate for rare class problems.

Human interaction

Building and applying a concept hierarchy representation of a user profile BIBAFull-Text 198-204
  Nikolaos Nanas; Victoria Uren; Anne De Roeck
Term dependence is a natural consequence of language use. Its successful representation has been a long standing goal for Information Retrieval research. We present a methodology for the construction of a concept hierarchy that takes into account the three basic dimensions of term dependence. We also introduce a document evaluation function that allows the use of the concept hierarchy as a user profile for Information Filtering. Initial experimental results indicate that this is a promising approach for incorporating term dependence in the way documents are filtered.
Query length in interactive information retrieval BIBAFull-Text 205-212
  N. J. Belkin; D. Kelly; G. Kim; J.-Y. Kim; H.-J. Lee; G. Muresan; M.-C. Tang; X.-J. Yuan; C. Cool
Query length in best-match information retrieval (IR) systems is well known to be positively related to effectiveness in the IR task, when measured in experimental, non-interactive environments. However, in operational, interactive IR systems, query length is quite typically very short, on the order of two to three words. We report on a study which tested the effectiveness of a particular query elicitation technique in increasing initial searcher query length, and which tested the effectiveness of queries elicited using this technique, and the relationship in general between query length and search effectiveness in interactive IR. Results show that the specific technique results in longer queries than a standard query elicitation technique, that this technique is indeed usable, that the technique results in increased user satisfaction with the search, and that query length is positively correlated with user satisfaction with the search.
Re-examining the potential effectiveness of interactive query expansion BIBAFull-Text 213-220
  Ian Ruthven
Much attention has been paid to the relative effectiveness of interactive query expansion versus automatic query expansion. Although interactive query expansion has the potential to be an effective means of improving a search, in this paper we show that, on average, human searchers are less likely than systems to make good expansion decisions. To enable good expansion decisions, searchers must have adequate instructions on how to use interactive query expansion functionalities. We show that simple instructions on using interactive query expansion do not necessarily help searchers make good expansion decisions and discuss difficulties found in making query expansion decisions.

IR theory

Latent concepts and the number orthogonal factors in latent semantic analysis BIBAFull-Text 221-226
  Georges Dupret
We seek insight into Latent Semantic Indexing by establishing a method to identify the optimal number of factors in the reduced matrix for representing a keyword. This method is demonstrated empirically by duplicating all documents containing a term t, and inserting new documents in the database that replace t with t'. By examining the number of times term t is identified for a search on term t' (precision) using differing ranges of dimensions, we find that lower ranked dimensions identify related terms and higher-ranked dimensions discriminate between the synonyms.
A frequency-based and a poisson-based definition of the probability of being informative BIBAFull-Text 227-234
  Thomas Roelleke
This paper reports on theoretical investigations about the assumptions underlying the inverse document frequency (idf). We show that an intuitive idf-based probability function for the probability of a term being informative assumes disjoint document events. By assuming documents to be independent rather than disjoint, we arrive at a Poisson-based probability of being informative. The framework is useful for understanding and deciding the parameter estimation and combination in probabilistic retrieval models.
Table extraction using conditional random fields BIBAFull-Text 235-242
  David Pinto; Andrew McCallum; Xing Wei; W. Bruce Croft
The ability to find tables and extract information from them is a necessary component of data mining, question answering, and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multi-dimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form.
   Their rich combination of formatting and content present difficulties for traditional language modeling techniques, however. This paper presents the use of conditional random fields (CRFs) for table extraction, and compares them with hidden Markov models (HMMs). Unlike HMMs, CRFs support the use of many rich and overlapping layout and language features, and as a result, they perform significantly better. We show experimental results on plain-text government statistical reports in which tables are located with 92% F1, and their constituent lines are classified into 12 table-related categories with 94% accuracy. We also discuss future work on undirected graphical models for segmenting columns, finding cells, and classifying them as data cells or label cells.

Filtering and retrieval models

Building a filtering test collection for TREC 2002 BIBAFull-Text 243-250
  Ian Soboroff; Stephen Robertson
Test collections for the filtering track in TREC have typically used either past sets of relevance judgments, or categorized collections such as Reuters Corpus Volume 1 or OHSUMED, because filtering systems need relevance judgments during the experiment for training and adaptation. For TREC 2002, we constructed an entirely new set of search topics for the Reuters Corpus for measuring filtering systems. Our method for building the topics involved multiple iterations of feedback from assessors, and fusion of results from multiple search systems using different search algorithms. We also developed a second set of "inexpensive" topics based on categories in the document collection. We found that the initial judgments made for the experiment were sufficient; subsequent pooled judging changed system rankings very little. We also found that systems performed very differently on the category topics than on the assessor-built topics.
An empirical study on retrieval models for different document genres: patents and newspaper articles BIBAFull-Text 251-258
  Makoto Iwayama; Atsushi Fujii; Noriko Kando; Yuzo Marukawa
Reflecting the rapid growth in the utilization of large test collections for information retrieval since the 1990s, extensive comparative experiments have been performed to explore the effectiveness of various retrieval models. However, most collections were intended for retrieving newspaper articles and technical abstracts. In this paper, we describe the process of producing a test collection for patent retrieval, the NTCIR-3 Patent Retrieval Collection, which includes two years of Japanese patent applications and 31 topics produced by professional patent searchers. We also report experimental results obtained by using this collection to re-examine the effectiveness of existing retrieval models in the context of patent retrieval. The relative superiority among existing retrieval models did not significantly differ depending on the document genre, that is, patents and newspaper articles. Issues related to patent retrieval are also discussed.
Collaborative filtering via gaussian probabilistic latent semantic analysis BIBAFull-Text 259-266
  Thomas Hofmann
Collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, i.e. a database of available user preferences. In this paper, we describe a new model-based algorithm designed for this task, which is based on a generalization of probabilistic latent semantic analysis to continuous-valued response variables. More specifically, we assume that the observed user ratings can be modeled as a mixture of user communities or interest groups, where users may participate probabilistically in one or more groups. Each community is characterized by a Gaussian distribution on the normalized ratings for each item. The normalization of ratings is performed in a user-specific manner to account for variations in absolute shift and variance of ratings. Experiments on the EachMovie data set show that the proposed approach compares favorably with other collaborative filtering techniques.


Document clustering based on non-negative matrix factorization BIBAFull-Text 267-273
  Wei Xu; Xin Liu; Yihong Gong
In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.
ReCoM: reinforcement clustering of multi-type interrelated data objects BIBAFull-Text 274-281
  Jidong Wang; Huajun Zeng; Zheng Chen; Hongjun Lu; Li Tao; Wei-Ying Ma
Most existing clustering algorithms cluster highly related data objects such as Web pages and Web users separately. The interrelation among different types of data objects is either not considered, or represented by a static feature space and treated in the same ways as other attributes of the objects. In this paper, we propose a novel clustering approach for clustering multi-type interrelated data objects, ReCoM (Reinforcement Clustering of Multi-type Interrelated data objects). Under this approach, relationships among data objects are used to improve the cluster quality of interrelated data objects through an iterative reinforcement clustering process. At the same time, the link structure derived from relationships of the interrelated data objects is used to differentiate the importance of objects and the learned importance is also used in the clustering process to further improve the clustering results. Experimental results show that the proposed approach not only effectively overcomes the problem of data sparseness caused by the high dimensional relationship space but also significantly improves the clustering accuracy.
A comparative study on content-based music genre classification BIBAFull-Text 282-289
  Tao Li; Mitsunori Ogihara; Qi Li
Content-based music genre classification is a fundamental component of music information retrieval systems and has been gaining importance and enjoying a growing amount of attention with the emergence of digital music on the Internet. Currently little work has been done on automatic music genre classification, and in addition, the reported classification accuracies are relatively low. This paper proposes a new feature extraction method for music genre classification, DWCHs. DWCHs stands for Daubechies Wavelet Coefficient Histograms. DWCHs capture the local and global information of music signals simultaneously by computing histograms on their Daubechies wavelet coefficients. Effectiveness of this new feature and of previously studied features are compared using various machine learning classification algorithms, including Support Vector Machines and Linear Discriminant Analysis. It is demonstrated that the use of DWCHs significantly improves the accuracy of music genre classification.

Distributed information retrieval

Evaluating different methods of estimating retrieval quality for resource selection BIBAFull-Text 290-297
  Henrik Nottelmann; Norbert Fuhr
In a federated digital library system, it is too expensive to query every accessible library. Resource selection is the task to decide to which libraries a query should be routed. Most existing resource selection algorithms compute a library ranking in a heuristic way. In contrast, the decision-theoretic framework (DTF) follows a different approach on a better theoretic foundation: It computes a selection which minimises the overall costs (e.g. retrieval quality, time, money) of the distributed retrieval. For estimating retrieval quality the recall-precision function is proposed. In this paper, we introduce two new methods: The first one computes the empirical distribution of the probabilities of relevance from a small library sample, and assumes it to be representative for the whole library. The second method assumes that the indexing weights follow a normal distribution, leading to a normal distribution for the document scores. Furthermore, we present the first evaluation of DTF by comparing this theoretical approach with the heuristical state-of-the-art system CORI; here we find that DTF outperforms CORI in most cases.
Relevant document distribution estimation method for resource selection BIBAFull-Text 298-305
  Luo Si; Jamie Callan
Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases. A new resource selection algorithm is proposed that uses information about database sizes as well as database contents. We also show how to acquire database size estimates in uncooperative environments as an extension of the query-based sampling used to acquire resource descriptions. Experiments demonstrate that the database size estimates are more accurate for large databases than estimates produced by a competing method; the new resource ranking algorithm is always at least as effective as the CORI algorithm; and the new algorithm results in better document rankings than the CORI algorithm.
SETS: search enhanced by topic segmentation BIBAFull-Text 306-313
  Mayank Bawa; Gurmeet Singh Manku; Prabhakar Raghavan
We present SETS, an architecture for efficient search in peer-to-peer networks, building upon ideas drawn from machine learning and social network theory. The key idea is to arrange participating sites in a topic-segmented overlay topology in which most connections are short-distance, connecting pairs of sites with similar content. Topically focused sets of sites are then joined together into a single network by long-distance links. Queries are matched and routed to only the topically closest regions. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is efficient in network traffic and query processing load.

Novelty and topic change

Retrieval and novelty detection at the sentence level BIBAFull-Text 314-321
  James Allan; Courtney Wade; Alvaro Bolivar
Previous research in novelty detection has focused on the task of finding novel material, given a set or stream of documents on a certain topic. This study investigates the more difficult two-part task defined by the TREC 2002 novelty track: given a topic and a group of documents relevant to that topic, 1) find the relevant sentences from the documents, and 2) find the novel sentences from the collection of relevant sentences. Our research shows that the former step appears to be the more difficult part of this task, and that the performance of novelty measures is very sensitive to the presence of non-relevant sentences.
Domain-independent text segmentation using anisotropic diffusion and dynamic programming BIBAFull-Text 322-329
  Xiang Ji; Hongyuan Zha
This paper presents a novel domain-independent text segmentation method, which identifies the boundaries of topic changes in long text documents and/or text streams. The method consists of three components: As a preprocessing step, we eliminate the document-dependent stop words as well as the generic stop words before the sentence similarity is computed. This step assists in the discrimination of the sentence semantic information. Then the cohesion information of sentences in a document or a text stream is captured with a sentence-distance matrix with each entry corresponding to the similarity between a sentence pair. The distance matrix can be represented with a gray-scale image. Thus, a text segmentation problem is converted into an image segmentation problem. We apply the anisotropic diffusion technique to the image representation of the distance matrix to enhance the semantic cohesion of sentence topical groups as well as sharpen topical boundaries. At last, the dynamic programming technique is adapted to find the optimal topical boundaries and provide a zoom-in and zoom-out mechanism for topics access by segmenting text in variable numbers of sentence topical groups. Our approach involves no domain-specific training, and it can be applied to texts in a variety of domains. The experimental results show that our approach is effective in text segmentation and outperforms several state-of-the-art methods.
A System for new event detection BIBAFull-Text 330-337
  Thorsten Brants; Francine Chen
We present a new method and system for performing the New Event Detection task, i.e., in one or multiple streams of news stories, all stories on a previously unseen (new) event are marked. The method is based on an incremental TF-IDF model. Our extensions include: generation of source-specific models, similarity score normalization based on document-specific averages, similarity score normalization based on source-pair specific averages, term reweighting based on inverse event frequencies, and segmentation of the documents. We also report on extensions that did not improve results. The system performs very well on TDT3 and TDT4 test data and scored second in the TDT-2002 evaluation.

Cross-lingual information retrieval

Probabilistic structured query methods BIBAFull-Text 338-344
  Kareem Darwish; Douglas W. Oard
Structured methods for query term replacement rely on separate estimates of term tes of replacement probabilities. Statistically significant frequency and document frequency to compute a weight for each query term. This paper reviews prior work on structured query techniques and introduces three new variants that leverage estima improvements in retrieval effectiveness are demonstrated for cross-language retrieval and for retrieval based on optical character recognition when replacement probabilities are used to estimate both term frequency and document frequency.
Fuzzy translation of cross-lingual spelling variants BIBAFull-Text 345-352
  Ari Pirkola; Jarmo Toivonen; Heikki Keskustalo; Kari Visala; Kalervo Jarvelin
We will present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first stage, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second stage, the intermediate forms obtained in the first stage are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The target word list contained 189 000 English words with the correct equivalents for the source words among them. The source words were translated using the two-step fuzzy translation technique, and the results were compared with those of plain fuzzy matching based translation. The combined technique performed better, sometimes considerably better, than fuzzy matching alone.
Automatic transliteration for Japanese-to-English text retrieval BIBAFull-Text 353-360
  Yan Qu; Gregory Grefenstette; David A. Evans
For cross language information retrieval (CLIR) based on bilingual translation dictionaries, good performance depends upon lexical coverage in the dictionary. This is especially true for languages possessing few inter-language cognates, such as between Japanese and English. In this paper, we describe a method for automatically creating and validating candidate Japanese transliterated terms of English words. A phonetic English dictionary and a set of probabilistic mapping rules are used for automatically generating transliteration candidates. A monolingual Japanese corpus is then used for automatically validating the transliterated terms. We evaluate the usage of the extracted English-Japanese transliteration pairs with Japanese to English retrieval experiments over the CLEF bilingual test collections. The use of our automatically derived extension to a bilingual translation dictionary improves average precision, both before and after pseudo-relevance feedback, with gains ranging from 2.5% to 64.8%.


On the effectiveness of evaluating retrieval systems in the absence of relevance judgments BIBAFull-Text 361-362
  Javed A. Aslam; Robert Savell
Soboroff, Nicholas and Cahan recently proposed a method for evaluating the performance of retrieval systems without relevance judgments. They demonstrated that the system evaluations produced by their methodology are correlated with actual evaluations using relevance judgments in the TREC competition. In this work, we propose an explanation for this phenomenon. We devise a simple measure for quantifying the similarity of retrieval systems by assessing the similarity of their retrieved results. Then, given a collection of retrieval systems and their retrieved results, we use this measure to assess the average similarity of a system to the other systems in the collection. We demonstrate that evaluating retrieval systems according to average similarity yields results quite similar to the methodology proposed by Soboroff et al., and we further demonstrate that these two techniques are in fact highly correlated. Thus, the techniques are effectively evaluating and ranking retrieval systems by "popularity" as opposed to "performance.
Resource selection and data fusion in multimedia distributed digital libraries BIBFull-Text 363-364
  Jamie Callan; Fabio Crestani; Henrik Nottelmann; Pietro Pala; Xiao Mang Shou
Transliteration of proper names in cross-language applications BIBFull-Text 365-366
  Paola Virga; Sanjeev Khudanpur
Toward a unification of text and link analysis BIBAFull-Text 367-368
  Brian D. Davison
This paper presents a simple yet profound idea. By thinking about the relationships between and within terms and documents, we can generate a richer representation that encompasses aspects of Web link analysis as well as text analysis techniques from information retrieval. This paper shows one path to this unified representation, and demonstrates the use of eigenvector calculations from Web link analysis by stepping through a simple example.
Investigating the relationship between language model perplexity and IR precision-recall measures BIBAFull-Text 369-370
  Leif Azzopardi; Mark Girolami; Keith van Risjbergen
An empirical study has been conducted investigating the relationship between the performance of an aspect based language model in terms of perplexity and the corresponding information retrieval performance obtained. It is observed, on the corpora considered, that the perplexity of the language model has a systematic relationship with the achievable precision recall performance though it is not statistically significant.
Topic distillation using hierarchy concept tree BIBAFull-Text 371-372
  Ikkyu Choi; Minkoo Kim
In this paper, we propose a new approach for topic distillation on World Wide Web. Topic distillation is to find quality documents related to the user query topic. Our approach is based on Bharat's topic distillation algorithm [1]. We present the analysis of hyperlink graph structure using hierarchy concept tree to solve the mixed hubs problem that is also remained in the Bharat's algorithm. For assigning better weights to hyperlinks which point to relevant documents among hyperlinks in a document, we try to find the relationship in documents connected by hyperlinks using content analysis and we assign weights to hyperlinks based on the relationship. We evaluated this algorithm using 50 topics on WT10g corpus and obtained improved results.
Using manually-built web directories for automatic evaluation of known-item retrieval BIBAFull-Text 373-374
  Steven M. Beitzel; Eric C. Jensen; Abdur Chowdhury; David Grossman; Ophir Frieder
Information retrieval system evaluation is complicated by the need for manually assessed relevance judgments. Large manually-built directories on the web open the door to new evaluation procedures. By assuming that web pages are the known relevant items for queries that exactly match their title, we use the ODP (Open Directory Project) and Looksmart directories for system evaluation. We test our approach with a sample from a log of ten million web queries and show that such an evaluation is unbiased in terms of the directory used, stable with respect to the query set selected, and correlated with a reasonably large manual evaluation.
Popular music retrieval by detecting mood BIBFull-Text 375-376
  Yazhong Feng; Yueting Zhuang; Yunhe Pan
Exploiting query history for document ranking in interactive information retrieval BIBAFull-Text 377-378
  Xuehua Shen; Cheng Xiang Zhai
In this poster, we incorporate user query history, as context information, to improve the retrieval performance in interactive retrieval. Experiments using the TREC data show that incorporating such context information indeed consistently improves the retrieval performance in both average precision and precision at 20 documents.
Automatic ranking of retrieval systems in imperfect environments BIBAFull-Text 379-380
  Rabia Nuray; Fazli Can
The empirical investigation of the effectiveness of information retrieval (IR) systems requires a test collection, a set of query topics, and a set of relevance judgments made by human assessors for each query. Previous experiments show that differences in human relevance assessments do not affect the relative performance of retrieval systems. Based on this observation, we propose and evaluate a new approach to replace the human relevance judgments by an automatic method. Ranking of retrieval systems with our methodology correlates positively and significantly with that of human-based evaluations. In the experiments, we assume a Web-like imperfect environment: the indexing information for all documents is available for ranking, but some documents may not be available for retrieval. Such conditions can be due to document deletions or network problems. Our method of simulating imperfect environments can be used for Web search engine assessment and in estimating the effects of network conditions (e.g., network unreliability) on IR system performance.
An investigation of broad coverage automatic pronoun resolution for information retrieval BIBAFull-Text 381-382
  Richard J. Edens; Helen L. Gaylard; Gareth J. F. Jones; Adenike M. Lam-Adesina
Term weighting methods have been shown to give significant increases in information retrieval performance. The presence of pronomial references in documents reduces the term frequencies of associated words with a consequent effect on term weights and information retrieval behaviour. This investigation explores the impact on information retrieval performance of broad coverage automatic pronoun resolution. Results indicate that this approach has potential to improve both precision at fixed cutoff levels and average precision.
Syntactic features in question answering BIBAFull-Text 383-384
  Xiaoyan Li
Syntactic information potentially plays a much more important role in question answering than it does in information retrieval. Although many people have used syntactic evidence in Question Answering, there haven't been many detailed experiments reported in the literature. The aim of the experiment described in this paper is to study the impact of a particular approach for using syntactic information on question answering effectiveness. Our results indicate that a combination of syntactic information with heuristics for ranking potential answers can perform better than the ranking heuristics on their own.
Searchers' criteria For assessing web pages BIBAFull-Text 385-386
  Anastasios Tombros; Ian Ruthven; Joemon M. Jose
We investigate the criteria used by online searchers when assessing the relevance of web pages to information-seeking tasks. Twenty four searchers were given three tasks each, and indicated the features of web pages which they employed when deciding about the usefulness of the pages. These tasks were presented within the context of a simulated work-task situation. The results of this study provide a set of criteria used by searchers to decide about the utility of web pages. Such criteria have implications for the design of systems that use or recommend web pages, as well as to authors of web pages.
When query expansion fails BIBAFull-Text 387-388
  Bodo Billerbeck; Justin Zobel
The effectiveness of queries in information retrieval can be improved through query expansion. This technique automatically introduces additional query terms that are statistically likely to match documents on the intended topic. However, query expansion techniques rely on fixed parameters. Our investigation of the effect of varying these parameters shows that the strategy of using fixed values is questionable.
Music modeling with random fields BIBFull-Text 389-390
  Victor Lavrenko; Jeremy Pickens
Fractal summarization: summarization based on fractal theory BIBAFull-Text 391-392
  Christopher C. Yang; Fu Lee Wang
In this paper, we introduce the fractal summarization model based on the fractal theory. In fractal summarization, the important information is captured from the source text by exploring the hierarchical structure and salient features of the document. A condensed version of the document that is informatively close to the original is produced iteratively using the contractive transformation in the fractal theory. User evaluation has shown that fractal summarization outperforms traditional summarization.
A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm BIBAFull-Text 393-394
  Javed A. Aslam; Virgiliu Pavlu; Robert Savell
We present a unified framework for simultaneously solving both the pooling problem (the construction of efficient document pools for the evaluation of retrieval systems) and metasearch (the fusion of ranked lists returned by retrieval systems in order to increase performance). The implementation is based on the Hedge algorithm for online learning, which has the advantage of convergence to bounded error rates approaching the performance of the best linear combination of the underlying systems. The choice of a loss function closely related to the average precision measure of system performance ensures that the judged document set performs well, both in constructing a metasearch list and as a pool for the accurate evaluation of retrieval systems. Our experimental results on TREC data demonstrate excellent performance in all measures -- evaluation of systems, retrieval of relevant documents, and generation of metasearch lists.
Statistical visual feature indexes in video retrieval BIBAFull-Text 395-396
  Xiangming Mu; Gary Marchionini
Four statistical visual feature indexes are proposed: SLM (Shot Length Mean), the average length of each shot in a video; SLD (Shot Length Deviation), the standard deviation of shot lengths for a video; ONM (Object Number Mean), the average number of objects per frame of the video; and OND (Object Number Deviation), the standard deviation of the number of objects per frame across the video. Each of these indexes provides a unique perspective on video content. A novel video retrieval interface has been developed as a platform to examine our assumption that the new indexes facilitate some video retrieval tasks. Initial feedback is promising and formal experiments are planned.
Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora BIBAFull-Text 397-398
  Fatiha Sadat; Masatoshi Yoshikawa; Shunsuke Uemura
This paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A combined statistics-based and linguistics-based model to select best translation candidates to phrasal translation is proposed. Evaluations using a large test collection for Japanese-English revealed the proposed combination of bi-directional comparable corpora, bilingual dictionaries and transliteration, augmented with linguistics-based pruning to be highly effective in Cross-Language Information Retrieval.
Document-self expansion for text categorization BIBAFull-Text 399-400
  Yuen-Hsien Tseng; Da-Wei Juang
Approaches to increase training examples to hopefully improve classification effectiveness are proposed in this work. The approaches were verified by use of two Chinese collections classified by two top-performing classifiers.
An architecture for peer-to-peer information retrieval BIBFull-Text 401-402
  Iraklis A. Klampanos; Joemon M. Jose
User-trainable video annotation using multimodal cues BIBAFull-Text 403-404
  C-Y. Lin; M. Naphade; A. Natsev; C. Neti; J. R. Smith; B. Tseng; H. J. Nock; W. Adams
This paper describes progress towards a general framework for incorporating multimodal cues into a trainable system for automatically annotating user-defined semantic concepts in broadcast video. Models of arbitrary concepts are constructed by building classifiers in a score space defined by a pre-deployed set of multimodal models. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus. An interesting side result shows speech-only models give performance comparable to our best video-only models for detecting visual concepts such as "outdoors", "face" and "cityscape".
Incorporating query term dependencies in language models for document retrieval BIBFull-Text 405-406
  Munirathnam Srikanth; Rohini Srihari
Error analysis of difficult TREC topics BIBAFull-Text 407-408
  Xiao Hu; Sindhura Bandhakavi; Chengxiang Zhai
Given the experimental nature of information retrieval, progress critically depends on analyzing the errors made by existing retrieval approaches and understanding their limitations. Our research explores various hypothesized reasons for hard topics in TREC-8 ad hoc task, and shows that the bad performance is partially due to the existence of highly distracting sub-collections that can dominate the overall performance.
XML retrieval: what to retrieve? BIBAFull-Text 409-410
  Jaap Kamps; Maarten Marx; Maarten de Rijke; Borkur Sigurbjornsson
The fundamental difference between standard information retrieval and XML retrieval is the unit of retrieval. In traditional IR, the unit of retrieval is fixed: it is the complete document. In XML retrieval, every XML element in a document is a retrievable unit. This makes XML retrieval more difficult: besides being relevant, a retrieved unit should be neither too large nor too small. The research presented here, a comparative analysis of two approaches to XML retrieval, aims to shed light on which XML elements should be retrieved. The experimental evaluation uses data from the Initiative for the Evaluation of XML retrieval (INEX 2002).
Discovering and structuring information flow among bioinformatics resources BIBAFull-Text 411-412
  Joan C. Bartlett; Elaine G. Toms
In this poster, we present a model of the flow of information among bioinformatics resources in the context of a specific scientific problem. Combining task analysis with traditional, qualitative research, we determined the extent to which the bioinformatics analysis process could be automated. The model represents a semi-automated process, involving fourteen distinct data processing steps, and forms the framework for an interface to bioinformatics information.
eBizSearch: a niche search engine for e-business BIBAFull-Text 413-414
  C. Lee Giles; Yves Petinot; Pradeep B. Teregowda; Hui Han; Steve Lawrence; Arvind Rangaswamy; Nirmal Pal
Niche Search Engines offer an efficient alternative to traditional search engines when the results returned by general-purpose search engines do not provide a sufficient degree of relevance. By taking advantage of their domain of concentration they achieve higher relevance and offer enhanced features. We discuss a new niche search engine, eBizSearch, based on the technology of CiteSeer and dedicated to e-business and e-business documents. We present the integration of CiteSeer in the framework of eBizSearch and the process necessary to tune the whole system towards the specific area of e-business. We also discuss how using machine learning algorithms we generate metadata to make eBizSearch Open Archives compliant. eBizSearch is a publicly available service and can be reached at [3].
Single n-gram stemming BIBAFull-Text 415-416
  James Mayfield; Paul McNamee
Stemming can improve retrieval accuracy, but stemmers are language-specific. Character n-gram tokenization achieves many of the benefits of stemming in a language independent way, but its use incurs a performance penalty. We demonstrate that selection of a single n-gram as a pseudo-stem for a word can be an effective and efficient language-neutral approach for some languages.
Average gain ratio: a simple retrieval performance measure for evaluation with multiple relevance levels BIBFull-Text 417-418
  Tetsuya Sakai
A comparison of various approaches for using probabilistic dependencies in language modeling BIBFull-Text 419-420
  Peter Bruza; Dawei Song
Topic hierarchy generation via linear discriminant projection BIBFull-Text 421-422
  Tao Li; Shenghuo Zhu; Mitsunori Ogihara
A personalised information retrieval tool BIBAFull-Text 423-424
  Innes Martin; Joemon M. Jose
Industry professionals and everyday users of the Internet have long accepted that due to both the size and growth of this ubiquitous repository, new tools are needed to assist with the finding and extraction of very specific resources relevant to a user's task. Previously, this definition of relevance has been based on the extremely generic matching between resources and query terms, but recently the emphasis is shifting towards a more personalised model based on the relevance of a particular resource for one specific user. We introduce a prototype, \tt Fetch, which adopts this concept within an information-seeking environment specifically designed to provide users with the means to better describe a problem (s)he doesn't understand.
Classification of source code archives BIBAFull-Text 425-426
  Robert Krovetz; Secil Ugurel; C. Lee Giles
The World Wide Web contains a number of source code archives. Programs are usually classified into various categories within the archive by hand. We report on experiments for automatic classification of source code into these categories. We examined a number of factors that affect classification accuracy. Weighting features by expected entropy loss makes a significant improvement in classification accuracy. We show a Support Vector Machine can be trained to classify source code with a high degree of accuracy. We feel these results show promise for software reuse.
Passage retrieval vs. document retrieval for factoid question answering BIBFull-Text 427-428
  Charles L. A. Clarke; Egidio L. Terra
Evaluating retrieval performance for Japanese question answering: what are best passages? BIBKFull-Text 429-430
  Tetsuya Sakai; Tomoharu Kokubu
Keywords: passage retrieval, question answering
Image classification using hybrid neural networks BIBAFull-Text 431-432
  Chih-Fong Tsai; Ken McGarry; John Tait
Use of semantic content is one of the major issues which needs to be addressed for improving image retrieval effectiveness. We present a new approach to classify images based on the combination of image processing techniques and hybrid neural networks. Multiple keywords are assigned to an image to represent its main contents, i.e. semantic content. Images are divided into a number of regions and colour and texture features are extracted. The first classifier, a self-organising map (SOM) clusters similar images based on the extracted features. Then, regions of the representative images of these clusters were labeled and used to train the second classifier, composed of several support vector machines (SVMs). Initial experiments on the accuracy of keyword assignment for a small vocabulary are reported.
On an equivalence between PLSI and LDA BIBAFull-Text 433-434
  Mark Girolami; Ata Kaban
Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior, therefore the perceived shortcomings of PLSI can be resolved and elucidated within the LDA framework.
Query word deletion prediction BIBAFull-Text 435-436
  Rosie Jones; Daniel C. Fain
Web search query logs contain traces of users' search modifications. One strategy users employ is deleting terms, presumably to obtain greater coverage. It is useful to model and automate term deletion when arbitrary searches are conjunctively matched against a small hand constructed collection, such as a hand-built hierarchy, or collection of high-quality pages matched with key phrases. Queries with no matches can have words deleted till a match is obtained. We provide algorithms which perform substantially better than the baseline in predicting which word should be deleted from a reformulated query, for increasing query coverage in the context of web search on small high-quality collections.
Assessing the effectiveness of pen-based input queries BIBAFull-Text 437-438
  Stephen Levin; Paul Clough; Mark Sanderson
In this poster, we describe an experiment exploring the effectiveness of a pen based text input device for use in query construction. Standard TREC queries were written, recognised, and subsequently retrieved upon. Comparisons between retrieval effectiveness based on the recognised writing and a typed text baseline were made. On average, effectiveness was 75% of the baseline. Other statistics on the quality and nature of recognition are also reported.
A light weight PDA-friendly collection fusion technique BIBAFull-Text 439-440
  Jeffery Antoniuk; Mario A. Nascimento
This short paper presents a light weight technique to merge results lists obtained from querying different databases. The motivation for such a technique is a general purpose search engine for Palm-OS based PDAs.
Speech-based and video-supported indexing of multimedia broadcast news BIBAFull-Text 441-442
  Yoshihiko Hayashi; Katsutoshi Ohtsuki; Katsuji Bessho; Osamu Mizuno; Yoshihiro Matsuo; Shoichi Matsunaga; Minoru Hayashi; Takaaki Hasegawa; Naruhiro Ikeda
This paper describes an automatic content indexing system for news programs, with a special emphasis on its segmentation process. The process can successfully segment an entire news program into topic-centered news stories; the primary tool is a linguistic topic segmentation algorithm. Experiments show that the resulting speech-based segments are fairly accurate, and scene change points supplied by an external video processor can be of help in improving segmentation effectiveness.
Summary evaluation and text categorization BIBAFull-Text 443-444
  Khurshid Ahmad; Bogdan Vrusias; Paulo C. F. de Oliveira
In general terms the evaluation of a summary depends on how close it is to the chief points in the source text. This begets the question as to what are the chief points in the source text and how is this information used in itself in identifying the source text. This is crucially important when we discuss automatic evaluation of summaries. So the question of main points is the source text. Typically, this would be around a nucleus of keywords. However, the salience, the frequency, and the relationship of the text with other texts in the collection (of these keywords is perhaps) are important. Text categorisation using neural networks explicates these points well and also has a practical impact.
Rule-based word clustering for text classification BIBAFull-Text 445-446
  Hui Han; Eren Manavoglu; C. Lee Giles; Hongyuan Zha
This paper introduces a rule-based, context-dependent word clustering method, with the rules derived from various domain databases and the word text orthographic properties. Besides significant dimensionality reduction, our experiments show that such rule-based word clustering improves by 8 the overall accuracy of extracting bibliographic fields from references, and by 18.32 on average the class-specific performance on the line classification of document headers.
HAT: a hardware assisted TOP-DOC inverted index component BIBAFull-Text 447-448
  S. Kagan Agun; Ophir Frieder
A novel Hardware Assisted Top-Doc (HAT) component is disclosed. HAT is an optimized content indexing device based on a modified inverted index structure. HAT accommodates patterns of different lengths and supports a varied posting list versus term count feature sustaining high reusability and efficiency. The developed component can be used either as an internal slave component or as an external co-processor and is efficient in resource demands as the component controllers take only a minimal percentage of the target device space leaving the majority of the space to term and posting entries. A Very High Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL) is used to model the HAT system.
An information-theoretic measure for document similarity BIBAFull-Text 449-450
  Javed A. Aslam; Meredith Frost
Recent work has demonstrated that the assessment of pairwise object similarity can be approached in an axiomatic manner using information theory. We extend this concept specifically to document similarity and test the effectiveness of an information-theoretic measure for pairwise document similarity. We adapt query retrieval to rate the quality of document similarity measures and demonstrate that our proposed information-theoretic measure for document similarity yields statistically significant improvements over other popular measures of similarity.
Optimizing term vectors for efficient and robust filtering BIBAFull-Text 451-452
  David A. Evans; Jeffrey Bennett; David A. Hull
We describe an efficient, robust method for selecting and optimizing terms for a classification or filtering task. Terms are extracted from positive examples in training data based on several alternative term-selection algorithms, then combined additively after a simple term-score normalization step to produce a merged and ranked master term vector. The score threshold for the master vector is set via beta-gamma regulation over all the available training data. The process avoids para-meter calibrations and protracted training. It also results in compact profiles for run-time evaluation of test (new) documents. Results on TREC-2002 filtering-task datasets demonstrate substantial improvements over TREC-median results and rival both idealized IR-based results and optimized (and expensive) SVM-based classifiers in general effectiveness.
The TREC-like evaluation of music IR systems BIBAFull-Text 453-454
  J. Stephen Downie
This poster reports upon the ongoing efforts being made to establish TREC-like and other comprehensive evaluation paradigms within the Music IR (MIR) and Music Digital Library (MDL) research communities. The proposed research tasks are based upon expert opinion garnered from members of the Information Retrieval (IR), MDL and MIR communities with regard to the construction and implementation of scientifically valid evaluation frameworks.
Stemming in the language modeling framework BIBFull-Text 455-456
  James Allan; Giridhar Kumaran
Generating hierarchical summaries for web searches BIBAFull-Text 457-458
  Dawn J. Lawrie; W. Bruce Croft
Hierarchies provide a means of organizing, summarizing and accessing information. We describe a method for automatically generating hierarchies from small collections of text, and then apply this technique to summarizing the documents retrieved by a search engine.
Analysis of anchor text for web search BIBFull-Text 459-460
  Nadav Eiron; Kevin S. McCurley


User-assisted query translation for interactive CLIR BIBFull-Text 461
  Daqing He; Jianqiang Wang; Douglas W. Oard; Michael Nossal
DefScriber: a hybrid system for definitional QA BIBFull-Text 462
  Sasha Blair-Goldensohn; Kathleen R. McKeown; Andrew Hazen Schlaikjer
Querying XML using structures and keywords in timber BIBAFull-Text 463
  Cong Yu; H. V. Jagadish; Dragomir R. Radev
This demonstration will describe how Timber, a native XML database system, has been extended with the capability to answer XML-style structured queries (e.g., XQuery) with embedded IR-style keyword-based non-boolean conditions. With the original structured query processing engine and the IR extensions built into the system, Timber is well suited for efficiently and effectively processing queries with both structural and textual content constraints.
SE-LEGO: creating metasearch engines on demand BIBFull-Text 464
  Zonghuan Wu; Vijay Raghavan; Chun Du; Komanduru Sai C; Weiyi Meng; Hai He; Clement Yu
MIND: resource selection and data fusion in multimedia distributed digital libraries BIBKFull-Text 465
  Stefano Berretti; Jamie Callan; Henrik Nottelmann; Xiao Mang Shou; Shengli Wu
Keywords: data fusion, networked retrieval, resource selection
Head/modifier pairs for everyone BIBFull-Text 466
  Cornelis H. A. Koster
Document retrieval from user-selected web sites BIBAFull-Text 467
  Ulrich Bohnacker; Ingrid Renz
We present a new tool for gathering textual information according to a query (texts) on arbitrary web sites specified by an information-seeking user. This tool is helpful in any knowledge-intensive area. Its technology is based on the vector space model with optimized feature definition.
eArchivarius: accessing collections of electronic mail BIBAFull-Text 468
  Anton Leuski; Douglas W. Oard; Rahul Bhagat
We present eArchivarius an interactive system for accessing collections of electronic mail. The system combines search, clustering visualization, and time-based visualization of email messages and people who send or received the messages.