HCI Bibliography Home | HCI Conferences | ECIR Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ECIR Tables of Contents: 03040506070809101112131415

Proceedings of ECIR'07, the 2007 European Conference on Information Retrieval

Fullname:ECIR 2007: Advances in Information Retrieval: 29th European Conference on IR Research
Editors:Giambattista Amati; Claudio Carpineto; Giovanni Romano
Location:Rome, Italy
Dates:2007-Apr-02 to 2007-Apr-05
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 4425
Standard No:DOI: 10.1007/978-3-540-71496-5 hcibib: ECIR07; ISBN: 978-3-540-71494-1 (print), 978-3-540-71496-5 (online)
Links:Online Proceedings | Conference Home Page
  1. Keynote Talks
  2. Theory and Design
  3. Efficiency
  4. Peer-to-Peer Networks (In Memory of Henrik Nottelmann)
  5. Result Merging
  6. Queries
  7. Relevance Feedback
  8. Evaluation
  9. Classification and Clustering
  10. Filtering
  11. Topic Identification
  12. Expert Finding
  13. XML IR
  14. Web IR
  15. Multimedia IR
  16. Short Papers
  17. Posters

Keynote Talks

The Next Generation Web Search and the Demise of the Classic IR Model BIBAFull-Text 1
  Andrei Broder
The classic IR model assumes a human engaged in activity that generates an "information need". This need is verbalized and then expressed as a query to search engine over a defined corpus. In the past decade, Web search engines have evolved from a first generation based on classic IR algorithms scaled to web size and thus supporting only informational queries, to a second generation supporting navigational queries using web specific information (primarily link analysis), to a third generation enabling transactional and other "semantic" queries based on a variety of technologies aimed to directly satisfy the unexpressed "user intent", thus moving further and further away from the classic model.
   What is coming next? In this talk, we identify two trends, both representing "short-circuits" of the model: The first is the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query. The information supply concept greatly precedes information retrieval; what is new in the web framework, is the ability to supply relevant information specific to a given activity and a given user, while the activity is being performed. Thus the entire verbalization and query-formation phase are eliminated. The second trend is "social search" driven by the fact that the Web has evolved to being simultaneously a huge repository of knowledge and a vast social environment. As such, it is often more effective to ask the members of a given web milieu rather than construct elaborate queries. This short-circuits only the query formulation, but allows information finding activities such as opinion elicitation and discovery of social norms, that are not expressible at all as queries against a fixed corpus.
The Last Half-Century: A Perspective on Experimentation in Information Retrieval BIBAFull-Text 2
  Stephen Robertson
The experimental evaluation of information retrieval systems has a venerable history. Long before the current notion of a search engine, in fact before search by computer was even feasible, people in the library and information science community were beginning to tackle the evaluation issue. Sometimes it feels as though evaluation methodology has become fixed (stable or frozen, according to your viewpoint). However, this is far from the case. Interest in methodological questions is as great now as it ever was, and new ideas are continuing to develop. This talk will be a personal take on the field.
Learning in Hyperlinked Environments BIBAFull-Text 3
  Marco Gori
A remarkable number of important problems in different domains (e.g. web mining, pattern recognition, biology ...) are naturally modeled by functions defined on graphical domains, rather than on traditional vector spaces. Following the recent developments in statistical relational learning, in this talk, I introduce Diffusion Learning Machines (DLM) whose computation is very much related to Web ranking schemes based on link analysis. Using arguments from function approximation theory, I argue that, as a matter of fact, DLM can compute any conceivable ranking function on the Web. The learning is based on a human supervision scheme that takes into account both the content and the links of the pages. I give very promising experimental results on artificial tasks and on the learning of functions used in link analysis, like PageRank, HITS, and TrustRank. Interestingly, the proposed learning mechanism is proven to be effective also when the rank depends jointly on the page content and on the links. Finally, I argue that the propagation of the relationships expressed by the links reduces dramatically the sample complexity with respect to traditional learning machines operating on vector spaces, thus making it reasonable the application to real-world problems on the Web, like spam detection and page classification.

Theory and Design

A Parameterised Search System BIBAFull-Text 4-15
  Roberto Cornacchia; Arjen P. de Vries
This paper introduces the concept of a Parameterised Search System (PSS), which allows flexibility in user queries, and, more importantly, allows system engineers to easily define customised search strategies. Putting this idea into practise requires a carefully designed system architecture that supports a declarative abstraction language for the specification of search strategies. These specifications should stay as close as possible to the problem definition (i.e., the retrieval model to be used in the search application), abstracting away the details of the physical organisation of data and content. We show how extending an existing XML retrieval system with an abstraction mechanism based on array databases meets this requirement.
Similarity Measures for Short Segments of Text BIBAFull-Text 16-27
  Donald Metzler; Susan Dumais; Christopher Meek
Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.
Multinomial Randomness Models for Retrieval with Document Fields BIBAFull-Text 28-39
  Vassilis Plachouras; Iadh Ounis
Document fields, such as the title or the headings of a document, offer a way to consider the structure of documents for retrieval. Most of the proposed approaches in the literature employ either a linear combination of scores assigned to different fields, or a linear combination of frequencies in the term frequency normalisation component. In the context of the Divergence From Randomness framework, we have a sound opportunity to integrate document fields in the probabilistic randomness model. This paper introduces novel probabilistic models for incorporating fields in the retrieval process using a multinomial randomness model and its information theoretic approximation. The evaluation results from experiments conducted with a standard TREC Web test collection show that the proposed models perform as well as a state-of-the-art field-based weighting model, while at the same time, they are theoretically founded and more extensible than current field-based models.
On Score Distributions and Relevance BIBAFull-Text 40-51
  Stephen Robertson
We discuss the idea of modelling the statistical distributions of scores of documents, classified as relevant or non-relevant. Various specific combinations of standard statistical distributions have been used for this purpose. Some theoretical considerations indicate problems with some of the choices of pairs of distributions. Specifically, we revisit a generalisation of the well-known inverse relationship between recall and precision: some choices of pairs of distributions violate this generalised relationship. We identify the choices and the violations, and explore some of the consequences of this theoretical view.
Modeling Term Associations for Ad-Hoc Retrieval Performance Within Language Modeling Framework BIBAKFull-Text 52-63
  Xing Wei; W. Bruce Croft
Previous research has shown that using term associations could improve the effectiveness of information retrieval (IR) systems. However, most of the existing approaches focus on query reformulation. Document reformulation has just begun to be studied recently. In this paper, we study how to utilize term association measures to do document modeling, and what types of measures are effective in document language models. We propose a probabilistic term association measure, compare it to some traditional methods, such as the similarity co-efficient and window-based methods, in the language modeling (LM) framework, and show that significant improvements over query likelihood (QL) retrieval can be obtained. We also compare the method with state-of-the-art document modeling techniques based on latent mixture models.
Keywords: Information Retrieval; Language Model; Term/Word Associations/ Relationships; Term/Word similarity; Document Model; Topic Model


Static Pruning of Terms in Inverted Files BIBAFull-Text 64-75
  Roi Blanco; Álvaro Barreiro
This paper addresses the problem of identifying collection dependent stop-words in order to reduce the size of inverted files. We present four methods to automatically recognise stop-words, analyse the tradeoff between efficiency and effectiveness, and compare them with a previous pruning approach. The experiments allow us to conclude that in some situations stop-words pruning is competitive with respect to other inverted file reduction techniques.
Efficient Indexing of Versioned Document Sequences BIBAFull-Text 76-87
  Michael Herscovici; Ronny Lempel; Sivan Yogev
Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.
Light Syntactically-Based Index Pruning for Information Retrieval BIBAFull-Text 88-100
  Christina Lioma; Iadh Ounis
Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of 'blocks of parts of speech' (POS blocks) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain 'non content-bearing parts of speech', such as prepositions for example, correspond to sequences of content-poor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general.
Sorting Out the Document Identifier Assignment Problem BIBAFull-Text 101-112
  Fabrizio Silvestri
The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.
Efficient Construction of FM-index Using Overlapping Block Processing for Large Scale Texts BIBAKFull-Text 113-123
  Di Zhang; Yunquan Zhang; Jing Chen
In previous implementations of FM-index, the construction algorithms usually need several times larger memory than text size. Sometimes the memory requirement prevents the FM-index from being employed in processing large scale texts. In this paper, we design an approach to constructing FM-index based on overlapping block processing. It can build the FM-index in linear time and constant temporary memory space, especially suitable for large scale texts. Instead of loading and indexing text as a whole, the new approach splits the text into blocks of fixed size, and then indexes them respectively. To assure the correctness and effectiveness of query operation, before indexing, we further append certain length of succeeding characters to the end of each block. The experimental results show that, with a slight loss on the compression ratio and query performance, our implementation provides a faster and more flexible solution for the problem of construction efficiency.
Keywords: FM-index; Self-index; Block processing

Peer-to-Peer Networks (In Memory of Henrik Nottelmann)

Performance Comparison of Clustered and Replicated Information Retrieval Systems BIBAKFull-Text 124-135
  Fidel Cacheda; Victor Carneiro; Vassilis Plachouras; Iadh Ounis
The amount of information available over the Internet is increasing daily as well as the importance and magnitude of Web search engines. Systems based on a single centralised index present several problems (such as lack of scalability), which lead to the use of distributed information retrieval systems to effectively search for and locate the required information. A distributed retrieval system can be clustered and/or replicated. In this paper, using simulations, we present a detailed performance analysis, both in terms of throughput and response time, of a clustered system compared to a replicated system. In addition, we consider the effect of changes in the query topics over time. We show that the performance obtained for a clustered system does not improve the performance obtained by the best replicated system. Indeed, the main advantage of a clustered system is the reduction of network traffic. However, the use of a switched network eliminates the bottleneck in the network, markedly improving the performance of the replicated systems. Moreover, we illustrate the negative performance effect of the changes over time in the query topics when a distributed clustered system is used. On the contrary, the performance of a distributed replicated system is query independent.
Keywords: distributed information retrieval; performance; simulation
A Study of a Weighting Scheme for Information Retrieval in Hierarchical Peer-to-Peer Networks BIBAFull-Text 136-147
  Massimo Melucci; Alberto Poggiani
The experimental results show that the proposed simple weighting scheme helps retrieve a significant proportion of relevant data after traversing only a small portion of a peer-to-peer hierarchical peer network in a depth-first manner. A real, large, highly heterogeneous test collection searched by very short, ambiguous queries was used for supporting the results. The efficiency and the effectiveness would suggest the implementation, for instance, in audio-video information retrieval systems, digital libraries or personal archives.
A Decision-Theoretic Model for Decentralised Query Routing in Hierarchical Peer-to-Peer Networks BIBAFull-Text 148-159
  Henrik Nottelmann; Norbert Fuhr
Efficient and effective routing of content-based queries is an emerging problem in peer-to-peer networks, and can be seen as an extension of the traditional "resource selection" problem. The decision-theoretic framework for resource selection aims, in contrast to other approaches, at minimising overall costs including e.g. monetary costs, time and retrieval quality. A variant of this framework has been successfully applied to hierarchical peer-to-peer networks (where peers are partitioned into DL peers and hubs), but that approach considers retrieval quality only. This paper proposes a new model which is capable of considering also the time costs of hubs (i.e., the number of hops in subsequent steps). The evaluation on a large test-bed shows that this approach dramatically reduces the overall retrieval costs.
Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval BIBAFull-Text 160-172
  Milad Shokouhi
Collection selection is one of the key problems in distributed information retrieval. Due to resource constraints it is not usually feasible to search all collections in response to a query. Therefore, the central component (broker) selects a limited number of collections to be searched for the submitted queries. During the past decade, several collection selection algorithms have been introduced. However, their performance varies on different testbeds. We propose a new collection-selection method based on the ranking of downloaded sample documents. We test our method on six testbeds and show that our technique can significantly outperform other state-of-the-art algorithms in most cases. We also introduce a new testbed based on the trec gov2 documents.

Result Merging

Results Merging Algorithm Using Multiple Regression Models BIBAKFull-Text 173-184
  George Paltoglou; Michail Salampasis; Maria Satratzemi
This paper describes a new algorithm for merging the results of remote collections in a distributed information retrieval environment. The algorithm makes use only of the ranks of the returned documents, thus making it very efficient in environments where the remote collections provide the minimum of cooperation. Assuming that the correlation between the ranks and the relevancy scores can be expressed through a logistic function and using sampled documents from the remote collections the algorithm assigns local scores to the returned ranked documents. Subsequently, using a centralized sample collection and through linear regression, it assigns global scores, thus producing a final merged document list for the user. The algorithm's effectiveness is measured against two state-of-the-art results merging algorithms and its performance is found to be superior to them in environments where the remote collections do not provide relevancy scores.
Keywords: Distributed Information Retrieval; Results Merging; Algorithms
Segmentation of Search Engine Results for Effective Data-Fusion BIBAFull-Text 185-197
  Milad Shokouhi
Metasearch and data-fusion techniques combine the rank lists of multiple document retrieval systems with the aim of improving search coverage and precision.
   We propose a new fusion method that partitions the rank lists of document retrieval systems into chunks. The size of chunks grows exponentially in the rank list. Using a small number of training queries, the probabilities of relevance of documents in different chunks are approximated for each search system. The estimated probabilities and normalized document scores are used to compute the final document ranks in the merged list. We show that our proposed method produces higher average precision values than previous systems across a range of testbeds.


Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions BIBAFull-Text 198-209
  Javed A. Aslam; Virgil Pavlu
We consider the issue of query performance, and we propose a novel method for automatically predicting the difficulty of a query. Unlike a number of existing techniques which are based on examining the ranked lists returned in response to perturbed versions of the query with respect to the given collection or perturbed versions of the collection with respect to the given query, our technique is based on examining the ranked lists returned by multiple scoring functions (retrieval engines) with respect to the given query and collection. In essence, we propose that the results returned by multiple retrieval engines will be relatively similar for "easy" queries but more diverse for "difficult" queries. By appropriately employing Jensen-Shannon divergence to measure the "diversity" of the returned results, we demonstrate a methodology for predicting query difficulty whose performance exceeds existing state-of-the-art techniques on TREC collections, often remarkably so.
Query Reformulation and Refinement Using NLP-Based Sentence Clustering BIBAFull-Text 210-221
  Frédéric Roulland; Aaron Kaplan; Stefania Castellani; Claude Roux; Antonietta Grasso; Karin Pettersson; Jacki O'Neill
We have developed an interactive query refinement tool that helps users search a knowledge base for solutions to problems with electronic equipment. The system is targeted towards non-technical users, who are often unable to formulate precise problem descriptions on their own. Two distinct but interrelated functionalities support the refinement of a vague, non-technical initial query into a more precise problem description: a synonymy mechanism that allows the system to match non-technical words in the query with corresponding technical terms in the knowledge base, and a novel refinement mechanism that helps the user build up successively longer and more precise problem descriptions starting from the seed of the initial query. A natural language parser is used both in the application of context-sensitive synonymy rules and the construction of the refinement tree.
Automatic Morphological Query Expansion Using Analogy-Based Machine Learning BIBAKFull-Text 222-233
  Fabienne Moreau; Vincent Claveau; Pascale Sébillot
Information retrieval systems (IRSs) usually suffer from a low ability to recognize a same idea that is expressed in different forms. A way of improving these systems is to take into account morphological variants. We propose here a simple yet effective method to recognize these variants that are further used so as to enrich queries. In comparison with already published methods, our system does not need any external resources or a priori knowledge and thus supports many languages. This new approach is evaluated against several collections, 6 different languages and is compared to existing tools such as a stemmer and a lemmatizer. Reported results show a significant and systematic improvement of the whole IRS efficiency both in terms of precision and recall for every language.
Keywords: Morphological variation; query expansion; analogy-based machine learning; unsupervised machine learning
Advanced Structural Representations for Question Classification and Answer Re-ranking BIBAFull-Text 234-245
  Silvia Quarteroni; Alessandro Moschitti; Suresh Manandhar; Roberto Basili
In this paper, we study novel structures to represent information in three vital tasks in question answering: question classification, answer classification and answer reranking. We define a new tree structure called PAS to represent predicate-argument relations, as well as a new kernel function to exploit its representative power. Our experiments with Support Vector Machines and several tree kernel functions suggest that syntactic information helps specific task as question classification, whereas, when data sparseness is higher as in answer classification, studying coarse semantic information like PAS is a promising research area.

Relevance Feedback

Incorporating Diversity and Density in Active Learning for Relevance Feedback BIBAFull-Text 246-257
  Zuobing Xu; Ram Akella; Yi Zhang
Relevance feedback, which uses the terms in relevant documents to enrich the user's initial query, is an effective method for improving retrieval performance. An associated key research problem is the following: Which documents to present to the user so that the user's feedback on the documents can significantly impact relevance feedback performance. This paper views this as an active learning problem and proposes a new algorithm which can efficiently maximize the learning benefits of relevance feedback. This algorithm chooses a set of feedback documents based on relevancy, document diversity and document density. Experimental results show a statistically significant and appreciable improvement in the performance of our new approach over the existing active feedback methods.
Relevance Feedback Using Weight Propagation Compared with Information-Theoretic Query Expansion BIBAKFull-Text 258-270
  Fadi Yamout; Michael Oakes; John Tait
A new Relevance Feedback (RF) technique called Weight Propagation has been developed which provides greater retrieval effectiveness and computational efficiency than previously described techniques. Documents judged relevant by the user propagate positive weights to documents close by in vector similarity space, while documents judged not relevant propagate negative weights to such neighbouring documents. Retrieval effectiveness is improved since the documents are treated as independent vectors rather than being merged into a single vector as is the case with traditional vector model RF techniques, or by determining the documents relevancy based in part on the lengths of all the documents as with traditional probabilistic RF techniques. Improving the computational efficiency of Relevance Feedback by considering only documents in a given neighbourhood means that the Weight Propagation technique can be used with large collections.
Keywords: Relevance Feedback; Rocchio; Ide; Deviation From Randomness


A Retrieval Evaluation Methodology for Incomplete Relevance Assessments BIBAFull-Text 271-282
  Mark Baillie; Leif Azzopardi; Ian Ruthven
In this paper we a propose an extended methodology for laboratory based Information Retrieval evaluation under incomplete relevance assessments. This new protocol aims to identify potential uncertainty during system comparison that may result from incompleteness. We demonstrate how this methodology can lead towards a finer grained analysis of systems. This is advantageous, because the detection of uncertainty during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections.
Evaluating Query-Independent Object Features for Relevancy Prediction BIBAFull-Text 283-294
  Andres R. Masegosa; Hideo Joho; Joemon M. Jose
This paper presents a series of experiments investigating the effectiveness of query-independent features extracted from retrieved objects to predict relevancy. Features were grouped into a set of conceptual categories, and individually evaluated based on click-through data collected in a laboratory-setting user study. The results showed that while textual and visual features were useful for relevancy prediction in a topic-independent condition, a range of features can be effective when topic knowledge was available. We also re-visited the original study from the perspective of significant features identified by our experiments.

Classification and Clustering

The Utility of Information Extraction in the Classification of Books BIBAKFull-Text 295-306
  Tom Betts; Maria Milosavljevic; Jon Oberlander
We describe work on automatically assigning classification labels to books using the Library of Congress Classification scheme. This task is non-trivial due to the volume and variety of books that exist. We explore the utility of Information Extraction (IE) techniques within this text categorisation (TC) task, automatically extracting structured information from the full text of books. Experimental evaluation of performance involves a corpus of books from Project Gutenberg. Results indicate that a classifier which combines methods and tools from IE and TC significantly improves over a state-of-the-art text classifier, achieving a classification performance of Fβ=1=0.8099.
Keywords: Information Extraction; Named Entity Recognition; Book Categorisation; Project Gutenberg; Ontologies; Digital Libraries
Combined Syntactic and Semantic Kernels for Text Classification BIBAFull-Text 307-318
  Stephan Bloehdorn; Alessandro Moschitti
The exploitation of syntactic structures and semantic background knowledge has always been an appealing subject in the context of text retrieval and information management. The usefulness of this kind of information has been shown most prominently in highly specialized tasks, such as classification in Question Answering (QA) scenarios. So far, however, additional syntactic or semantic information has been used only individually. In this paper, we propose a principled approach for jointly exploiting both types of information. We propose a new type of kernel, the Semantic Syntactic Tree Kernel (SSTK), which incorporates linguistic structures, e.g. syntactic dependencies, and semantic background knowledge, e.g. term similarity based on WordNet, to automatically learn question categories in QA. We show the power of this approach in a series of experiments with a well known Question Classification dataset.
Fast Large-Scale Spectral Clustering by Sequential Shrinkage Optimization BIBAFull-Text 319-330
  Tie-Yan Liu; Huai-Yuan Yang; Xin Zheng; Tao Qin; Wei-Ying Ma
In many applications, we need to cluster large-scale data objects. However, some recently proposed clustering algorithms such as spectral clustering can hardly handle large-scale applications due to the complexity issue, although their effectiveness has been demonstrated in previous work. In this paper, we propose a fast solver for spectral clustering. In contrast to traditional spectral clustering algorithms that first solve an eigenvalue decomposition problem, and then employ a clustering heuristic to obtain labels for the data points, our new approach sequentially decides the labels of relatively well-separated data points. Because the scale of the problem shrinks quickly during this process, it can be much faster than the traditional methods. Experiments on both synthetic data and a large collection of product records show that our algorithm can achieve significant improvement in speed as compared to traditional spectral clustering algorithms.
A Probabilistic Model for Clustering Text Documents with Multiple Fields BIBAFull-Text 331-342
  Shanfeng Zhu; Ichigaku Takigawa; Shuqin Zhang; Hiroshi Mamitsuka
We address the problem of clustering documents with multiple fields, such as scientific literature with the distinct fields: title, abstract, keywords, main text and references. By taking into consideration of the distinct word distributions of each field, we propose a new probabilistic model, Field Independent Clustering Model (FICM), for clustering documents with multiple fields. The benefits of FICM come not only from integrating the discrimination abilities of each field but also from the power of selecting the most suitable component probabilistic model for each field. We examined the performance of FICM on the problem of clustering biomedical documents with three fields (title, abstract and MeSH). From the genomics track data of TREC 2004 and TREC 2005, we randomly generated 60 datasets where the number of classes in each dataset ranged from 3 to 12. By applying the appropriate configuration of generative models for each field, FICM outperformed a classical multinomial model in 59 out of the total 60 datasets, of which 47 were statistically significant at the 95% level, and FICM also outperformed a multivariate Bernoulli model in 52 out of the total 60 datasets, of which 36 were statistically significant at the 95% level.


Personalized Communities in a Distributed Recommender System BIBAFull-Text 343-355
  Sylvain Castagnos; Anne Boyer
The amount of data exponentially increases in information systems and it becomes more and more difficult to extract the most relevant information within a very short time. Among others, collaborative filtering processes help users to find interesting items by modeling their preferences and by comparing them with users having the same tastes. Nevertheless, there are a lot of aspects to consider when implementing such a recommender system. The number of potential users and the confidential nature of some data are taken into account. This paper introduces a new distributed recommender system based on a user-based filtering algorithm. Our model has been transposed for Peer-to-Peer architectures. It has been especially designed to deal with problems of scalability and privacy. Moreover, it adapts its prediction computations to the density of the user neighborhood.
Information Recovery and Discovery in Collaborative Web Search BIBAFull-Text 356-367
  Maurice Coyle; Barry Smyth
When we search for information we are usually either trying to recover something that we have found in the past or trying to discover some new information. In this paper we will evaluate how the collaborative Web search technique, which personalizes search results for communities of like-minded users, can help in recovery-and discovery-type search tasks in a corporate search scenario.
Collaborative Filtering Based on Transitive Correlations Between Items BIBAFull-Text 368-380
  Alexandros Nanopoulos
With existing collaborative filtering algorithms, a user has to rate a sufficient number of items, before receiving reliable recommendations. To overcome this limitation, we provide the insight that correlations between items can form a network, in which we examine transitive correlations between items. The emergence of power laws in such networks signifies the existence of items with substantially more transitive correlations. The proposed algorithm finds highly correlative items and provides effective recommendations by adapting to user preferences. We also develop pruning criteria that reduce computation time. Detailed experimental results illustrate the superiority of the proposed method.
Entropy-Based Authorship Search in Large Document Collections BIBAFull-Text 381-392
  Ying Zhao; Justin Zobel
The purpose of authorship search is to identify documents written by a particular author in large document collections. Standard search engines match documents to queries based on topic, and are not applicable to authorship search. In this paper we propose an approach to authorship search based on information theory. We propose relative entropy of style markers for ranking, inspired by the language models used in information retrieval. Our experiments on collections of newswire texts show that, with simple style markers and sufficient training data, documents by a particular author can be accurately found from within large collections. Although effectiveness does degrade as collection size is increased, with even 500,000 documents nearly half of the top-ranked documents are correct matches. We have also found that the authorship search approach can be used for authorship attribution, and is much more scalable than state-of-art approaches in terms of the collection size and the number of candidate authors.

Topic Identification

Use of Topicality and Information Measures to Improve Document Representation for Story Link Detection BIBAFull-Text 393-404
  Chirag Shah; Koji Eguchi
Several information organization, access, and filtering systems can benefit from different kind of document representations than those used in traditional Information Retrieval (IR). Topic Detection and Tracking (TDT) is an example of such a domain. In this paper we demonstrate that traditional methods for term weighing does not capture topical information and this leads to inadequate representation of documents for TDT applications. We present various hypotheses regarding the factors that can help in improving the document representation for Story Link Detection (SLD) -- a core task of TDT. These hypotheses are tested using various TDT corpora. From our experiments and analysis we found that in order to obtain a faithful representation of documents in TDT domain, we not only need to capture a term's importance in traditional IR sense, but also evaluate its topical behavior. Along with defining this behavior, we propose a novel measure that captures a term's importance at the corpus level as well as its discriminating power for topics. This new measure leads to a much better document representation as reflected by the significant improvements in the results.
Ad Hoc Retrieval of Documents with Topical Opinion BIBAFull-Text 405-417
  Jason Skomorowski; Olga Vechtomova
With a growing amount of subjective content distributed across the Web, there is a need for a domain-independent information retrieval system that would support ad hoc retrieval of documents expressing opinions on a specific topic of the user's query. In this paper we present a lightweight method for ad hoc retrieval of documents which contain subjective content on the topic of the query. Documents are ranked by the likelihood each document expresses an opinion on a query term, approximated as the likelihood any occurrence of the query term is modified by a subjective adjective. Domain-independent user-based evaluation of the proposed method was conducted, and shows statistically significant gains over the baseline system.

Expert Finding

Probabilistic Models for Expert Finding BIBAFull-Text 418-430
  Hui Fang; ChengXiang Zhai
A common task in many applications is to find persons who are knowledgeable about a given topic (i.e., expert finding). In this paper, we propose and develop a general probabilistic framework for studying expert finding problem and derive two families of generative models (candidate generation models and topic generation models) from the framework. These models subsume most existing language models proposed for expert finding. We further propose several techniques to improve the estimation of the proposed models, including incorporating topic expansion, using a mixture model to model candidate mentions in the supporting documents, and defining an email count-based prior in the topic generation model. Our experiments show that the proposed estimation strategies are all effective to improve retrieval accuracy.
Using Relevance Feedback in Expert Search BIBAFull-Text 431-443
  Craig Macdonald; Iadh Ounis
In Enterprise settings, expert search is considered an important task. In this search task, the user has a need for expertise -- for instance, they require assistance from someone about a topic of interest. An expert search system assists users with their "expertise need" by suggesting people with relevant expertise to the topic of interest. In this work, we apply an expert search approach that does not explicitly rank candidates in response to a query, but instead implicitly ranks candidates by taking into account a ranking of document with respect to the query topic. Pseudo-relevance feedback, aka query expansion, has been shown to improve retrieval performance in adhoc search tasks. In this work, we investigate to which extent query expansion can be applied in an expert search task to improve the accuracy of the generated ranking of candidates. We define two approaches for query expansion, one based on the initial of ranking of documents for the query topic. The second approach is based on the final ranking of candidates. The aims of this paper are two-fold. Firstly, to determine if query expansion can be successfully applied in the expert search task, and secondly, to ascertain if either of the two forms of query expansion can provide robust, improved retrieval performance. We perform a thorough evaluation contrasting the two query expansion approaches in the context of the TREC 2005 and 2006 Enterprise tracks.


Using Topic Shifts for Focussed Access to XML Repositories BIBAFull-Text 444-455
  Elham Ashoori; Mounia Lalmas
In focussed XML retrieval, a retrieval unit is an XML element that not only contains information relevant to a user query, but also is specific to the query. INEX defines a relevant element to be at the right level of granularity if it is exhaustive and specific to the user's request -- i.e., it discusses fully the topic requested in the user's query and no other topics. The exhaustivity and specificity dimensions are both expressed in terms of the "quantity" of topics discussed within each element. We therefore propose to use the number of topic shifts in an XML element, to express the "quantity" of topics discussed in an element as a mean to capture specificity. We experimented with a number of element-specific smoothing methods within the language modelling framework. These methods enable us to adjust the amount of smoothing required for each XML element depending on its number of topic shifts, to capture specificity. Using the number of topic shifts combined with element length improves retrieval effectiveness, thus indicating that the number of topic shifts is a useful evidence in focussed XML retrieval.
Feature- and Query-Based Table of Contents Generation for XML Documents BIBAFull-Text 456-467
  Zoltán Szlávik; Anastasios Tombros; Mounia Lalmas
The availability of a document's logical structure in XML retrieval allows retrieval systems to return document portions (elements) instead of whole documents. This helps searchers focusing their attention to the relevant content within a document. However, other, e.g. sibling or parent, elements of retrieved elements may also be important as they provide context to the retrieved elements. The use of table of contents (TOC) offers an overview of a document and shows the most important elements and their relations to each other. In this paper, we investigate what searchers think is important in automatic TOC generation. We ask searchers to indicate their preferences for element features (depth, length, relevance) in order to generate TOCs that help them complete information seeking tasks. We investigate what these preferences are, and what are the characteristics of the TOCs generated by searchers' settings. The results have implications for the design of intelligent TOC generation approaches for XML retrieval.

Web IR

Setting Per-field Normalisation Hyper-parameters for the Named-Page Finding Search Task BIBAFull-Text 468-480
  Ben He; Iadh Ounis
Per-field normalisation has been shown to be effective for Web search tasks, e.g. named-page finding. However, per-field normalisation also suffers from having hyper-parameters to tune on a per-field basis. In this paper, we argue that the purpose of per-field normalisation is to adjust the linear relationship between field length and term frequency. We experiment with standard Web test collections, using three document fields, namely the body of the document, its title, and the anchor text of its incoming links. From our experiments, we find that across different collections, the linear correlation values, given by the optimised hyper-parameter settings, are proportional to the maximum negative linear correlation. Based on this observation, we devise an automatic method for setting the per-field normalisation hyper-parameter values without the use of relevance assessment for tuning. According to the evaluation results, this method is shown to be effective for the body and title fields. In addition, the difficulty in setting the per-field normalisation hyper-parameter for the anchor text field is explained.
Combining Evidence for Relevance Criteria: A Framework and Experiments in Web Retrieval BIBAKFull-Text 481-493
  Theodora Tsikrika; Mounia Lalmas
We present a framework that assesses relevance with respect to several relevance criteria, by combining the query-dependent and query-independent evidence indicating these criteria. This combination of evidence is modelled in a uniform way, irrespective of whether the evidence is associated with a single document or related documents. The framework is formally expressed within Dempster-Shafer theory. It is evaluated for web retrieval in the context of TREC's Topic Distillation task. Our results indicate that aggregating content-based evidence from the linked pages of a page is beneficial, and that the additional incorporation of their homepage evidence further improves the effectiveness.
Keywords: Dempster-Shafer theory; topic distillation; best entry point

Multimedia IR

Classifier Fusion for SVM-Based Multimedia Semantic Indexing BIBAFull-Text 494-504
  Stéphane Ayache; Georges Quénot; Jérôme Gensel
Concept indexing in multimedia libraries is very useful for users searching and browsing but it is a very challenging research problem as well. Combining several modalities, features or concepts is one of the key issues for bridging the gap between signal and semantics. In this paper, we present three fusion schemes inspired from the classical early and late fusion schemes. First, we present a kernel-based fusion scheme which takes advantage of the kernel basis of classifiers such as SVMs. Second, we integrate a new normalization process into the early fusion scheme. Third, we present a contextual late fusion scheme to merge classification scores of several concepts. We conducted experiments in the framework of the official TRECVID'06 evaluation campaign and we obtained significant improvements with the proposed fusion schemes relatively to usual fusion schemes.
Search of Spoken Documents Retrieves Well Recognized Transcripts BIBAFull-Text 505-516
  Mark Sanderson; Xiao Mang Shou
This paper presents a series of analyses and experiments on spoken document retrieval systems: search engines that retrieve transcripts produced by speech recognizers. Results show that transcripts that match queries well tend to be recognized more accurately than transcripts that match a query less well. This result was described in past literature, however, no study or explanation of the effect has been provided until now. This paper provides such an analysis showing a relationship between word error rate and query length. The paper expands on past research by increasing the number of recognitions systems that are tested as well as showing the effect in an operational speech retrieval system. Potential future lines of enquiry are also described.

Short Papers

Natural Language Processing for Usage Based Indexing of Web Resources BIBAFull-Text 517-524
  Anne Boyer; Armelle Brun
The identification of reliable and interesting items on Internet becomes more and more difficult and time consuming. This paper is a position paper describing our intended work in the framework of multimedia information retrieval by browsing techniques within web navigation. It relies on a usage-based indexing of resources: we ignore the nature, the content and the structure of resources. We describe a new approach taking advantage of the similarity between statistical modeling of language and document retrieval systems. A syntax of usage is computed that designs a Statistical Grammar of Usage (SGU). A SGU enables resources classification to perform a personalized navigation assistant tool. It relies both on collaborative filtering to compute virtual communities of users and classical statistical language models. The resulting SGU is a community dependent SGU.
Harnessing Trust in Social Search BIBAFull-Text 525-532
  Peter Briggs; Barry Smyth
The social Web emphasises the increased role of millions of users in the creation of a new type of online content, often expressed in the form of opinions or judgements. This has led to some novel approaches to information access that take advantage of user opinions and activities as a way to guide users as they browse or search for information. We describe a social search technique that harnesses the experiences of a network of searchers to generate result recommendations that can complement the search results that are returned by some standard Web search engine.
How to Compare Bilingual to Monolingual Cross-Language Information Retrieval BIBAFull-Text 533-540
  Franco Crivellari; Giorgio Maria Di Nunzio; Nicola Ferro
The study of cross-lingual Information Retrieval Systems (IRSs) and a deep analysis of system performances should provide guidelines, hints, and directions to drive the design and development of the next generation MultiLingual Information Access (MLIA) systems. In addition, effective tools for interpreting and comparing the experimental results should be made easily available to the research community. To this end, we propose a twofold methodology for the evaluation of Cross Language Information Retrieval (CLIR) systems: statistical analyses to provide MLIA researchers with quantitative and more sophisticated analysis techniques; and graphical tools to allow for a more qualitative comparison and an easier presentation of the results. We provide concrete examples about how the proposed methodology can be applied by studying the monolingual and bilingual tasks of the Cross-Language Evaluation Forum (CLEF) 2005 and 2006 campaigns.
Multilingual Text Classification Using Ontologies BIBAFull-Text 541-548
  Gerard de Melo; Stefan Siersdorfer
In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples and then automatically classify documents that might be provided in entirely different languages. Our approach makes use of ontologies and lexical resources but goes beyond a simple mapping from terms to concepts by fully exploiting the external knowledge manifested in such resources and mapping to entire regions of concepts. For this, a graph traversal algorithm is used to explore related concepts that might be relevant. Extensive testing has shown that our methods lead to significant improvements compared to existing approaches.
Using Visual-Textual Mutual Information and Entropy for Inter-modal Document Indexing BIBAKFull-Text 549-556
  Jean Martinet; Shin'ichi Satoh
This paper presents a contribution in the domain of automatic visual document indexing based on inter-modal analysis, in the form of a statistical indexing model. The approach is based on inter-modal document analysis, which consists in modeling and learning some relationships between several modalities from a data set of annotated documents in order to extract semantics. When one of the modalities is textual, the learned associations can be used to predict a textual index for visual data from a new document (image or video). More specifically, the presented approach relies on a learning process in which associations between visual and textual information are characterized by the mutual information of the modalities. Besides, the model uses the information entropy of the distribution of the visual modality against the textual modality as a second source to select relevant indexing terms. We have implemented the proposed information theoretic model, and the results of experiments assessing its performance on two collections (image and video) show that information theory is an interesting framework to automatically annotate documents.
Keywords: Indexing model; mutual information; entropy; inter-modal analysis
A Study of Global Inference Algorithms in Multi-document Summarization BIBAFull-Text 557-564
  Ryan McDonald
In this work we study the theoretical and empirical properties of various global inference algorithms for multi-document summarization. We start by defining a general framework for inference in summarization. We then present three algorithms: The first is a greedy approximate method, the second a dynamic programming approach based on solutions to the knapsack problem, and the third is an exact algorithm that uses an Integer Linear Programming formulation of the problem. We empirically evaluate all three algorithms and show that, relative to the exact solution, the dynamic programming algorithm provides near optimal results with preferable scaling properties.
Document Representation Using Global Association Distance Model BIBAFull-Text 565-572
  José E. Medina-Pagola; Ansel Y. Rodríguez; Abdel Hechavarría; José Hernández Palancar
Text information processing depends critically on the proper representation of documents. Traditional models, like the vector space model, have significant limitations because they do not consider semantic relations amongst terms. In this paper we analyze a document representation using the association graph scheme and present a new approach called Global Association Distance Model (GADM). At the end, we compare GADM using K-NN classifier with the classical vector space model and the association graph model.
Sentence Level Sentiment Analysis in the Presence of Conjuncts Using Linguistic Analysis BIBAKFull-Text 573-580
  Arun Meena; T. V. Prabhakar
In this paper we present an approach to extract sentiments associated with a phrase or sentence. Sentiment analysis has been attempted mostly for documents typically a review or a news item. Conjunctions have a substantial impact on the overall sentiment of a sentence, so here we present how atomic sentiments of individual phrases combine together in the presence of conjuncts to decide the overall sentiment of a sentence. We used word dependencies and dependency trees to analyze the sentence constructs and were able to get results close to 80%. We have also analyzed the effect of WordNet on the accuracy of the results over General Inquirer.
Keywords: Sentiment analysis; favorability analysis; text mining; information extraction; semantic orientation; text classification
PageRank: When Order Changes BIBAFull-Text 581-588
  Massimo Melucci; Luca Pretto
As PageRank is a ranking algorithm, it is of prime interest to study the order induced by its values on webpages. In this paper a thorough mathematical analysis of PageRank-induced order changes when the damping factor varies is provided. Conditions that do not allow variations in the order are studied, and the mechanisms that make the order change are mathematically investigated. Moreover the influence on the order of a truncation in the actual computation of PageRank through a power series is analysed. Experiments carried out on a large Web digraph to integrate the mathematical analysis show that PageRank -- while working on a real digraph -- tends to hinder variations in the order of large rankings, presenting a high stability in its induced order both in the face of large variations of the damping factor value and in the face of truncations in its computation.
Model Tree Learning for Query Term Weighting in Question Answering BIBAFull-Text 589-596
  Christof Monz
Question answering systems rely on retrieval components to identify documents that contain an answer to a user's question. The formulation of queries that are used for retrieving those documents has a strong impact on the effectiveness of the retrieval component. Here, we focus on predicting the importance of terms from the original question. We use model tree machine learning techniques in order to assign weights to query terms according to their usefulness for identifying documents that contain an answer. Incorporating the learned weights into a state-of-the-art retrieval system results in statistically significant improvements.
Examining Repetition in User Search Behavior BIBAFull-Text 597-604
  Mark Sanderson; Susan Dumais
This paper describes analyses of the repeated use of search engines. It is shown that users commonly re-issue queries, either to examine search results deeply or simply to query again, often days or weeks later. Hourly and weekly periodicities in behavior are observed for both queries and clicks. Navigational queries were found to be repeated differently from others.
Popularity Weighted Ranking for Academic Digital Libraries BIBAKFull-Text 605-612
  Yang Sun; C. Lee Giles
We propose a popularity weighted ranking algorithm for academic digital libraries that uses the popularity factor of a publication venue overcoming the limitations of impact factors. We compare our method with the naive PageRank, citation counts and HITS algorithm, three popular measures currently used to rank papers beyond lexical similarity. The ranking results are evaluated by discounted cumulative gain (DCG) method using four human evaluators. We show that our proposed ranking algorithm improves the DCG performance by 8.5% on average compared to naive PageRank, 16.3% compared to citation count and 23.2% compared to HITS. The algorithm is also evaluated by click through data from CiteSeer usage log.
Keywords: weighted ranking; citation analysis; digital library
Naming Functions for the Vector Space Model BIBAFull-Text 613-620
  Yannis Tzitzikas; Yannis Theoharis
The Vector Space Model (VSM) is probably the most widely used model for retrieving information from text collections (and recently from over other kinds of corpi). Assuming this model, we study the problem of finding the best query that "names" (or describes) a given (unordered or ordered) set of objects. We formulate several variations of this problem and we provide methods and algorithms for solving them.
Effective Use of Semantic Structure in XML Retrieval BIBAFull-Text 621-628
  Roelof van Zwol; Tim van Loosbroek
The objective of XML retrieval is to return relevant XML document fragments that answer a given user information need, by exploiting the document structure. The focus in this article is on automatically deriving and using semantic XML structure to enhance the retrieval performance of XML retrieval systems. Based on a naive approach for named entity detection, we discuss how the structure of an XML document can be enriched using the Reuters 21587 news collection.
   Based on a retrieval performance experiment, we study the effect of the additional semantic structure on the retrieval performance of our XSee search engine for XML documents. The experiment provides some initial evidence that an XML retrieval system significantly benefits from having meaningful XML structure.
Searching Documents Based on Relevance and Type BIBAFull-Text 629-636
  Jun Xu; Yunbo Cao; Hang Li; Nick Craswell; Yalou Huang
This paper extends previous work on document retrieval and document type classification, addressing the problem of 'typed search'. Specifically, given a query and a designated document type, the search system retrieves and ranks documents not only based on the relevance to the query, but also based on the likelihood of being the designated document type. The paper formalizes the problem in a general framework consisting of 'relevance model' and 'type model'. The relevance model indicates whether or not a document is relevant to a query. The type model indicates whether or not a document belongs to the designated document type. We consider three methods for combing the models: linear combination of scores, thresholding on the type score, and a hybrid of the previous two methods. We take course page search and instruction document search as examples and have conducted a series of experiments. Experimental results show our proposed approaches can significantly outperform the baseline methods.
Investigation of the Effectiveness of Cross-Media Indexing BIBAFull-Text 637-644
  Murat Yakici; Fabio Crestani
Cross-media analysis and indexing leverage the individual potential of each indexing information provided by different modalities, such as speech, text and image, to improve the effectiveness of information retrieval and filtering in later stages. The process does not only constitute generating a merged representation of the digital content, such as MPEG-7, but also enriching it in order to help remedy the imprecision and noise introduced during the low-level analysis phases. It has been hypothesized that a system that combines different media descriptions of the same multi-modal audio-visual segment in a semantic space will perform better at retrieval and filtering time. In order to validate this hypothesis, we have developed a cross-media indexing system which utilises the Multiple Evidence approach by establishing links among the modality specific textual descriptions in order to depict topical similarity.
Improve Ranking by Using Image Information BIBAKFull-Text 645-652
  Qing Yu; Shuming Shi; Zhiwei Li; Ji-Rong Wen; Wei-Ying Ma
This paper explores the feasibility of including image information embedded in Web pages in relevance computation to improve search performance. In determining the ranking of Web pages against a given query, most (if not all) modern Web search engines consider two kinds of factors: text information (including title, URL, body text, anchor text, etc) and static ranking (e.g. PageRank [1]). Although images have been widely used to help represent Web pages and carry valuable information, little work has been done to take advantage of them in computing the relevance score of a Web page given a query. We propose, in this paper, a framework to contain image information in ranking functions. Preliminary experimental results show that, when image information is used properly, ranking results can be improved.
Keywords: Web search; image information; image importance; relevance
N-Step PageRank for Web Search BIBAFull-Text 653-660
  Li Zhang; Tao Qin; Tie-Yan Liu; Ying Bao; Hang Li
PageRank has been widely used to measure the importance of web pages based on their interconnections in the web graph. Mathematically speaking, PageRank can be explained using a Markov random walk model, in which only the direct outlinks of a page contribute to its transition probability. In this paper, we propose improving the PageRank algorithm by looking N-step ahead when constructing the transition probability matrix. The motivation comes from the similar "looking N-step ahead" strategy that is successfully used in computer chess. Specifically, we assume that if the random surfer knows the N-step outlinks of each web page, he/she can make a better decision on choosing which page to navigate for the next time. It is clear that the classical PageRank algorithm is a special case of our proposed N-step PageRank method. Experimental results on the dataset of TREC Web track show that our proposed algorithm can boost the search accuracy of classical PageRank by more than 15% in terms of mean average precision.
Authorship Attribution Via Combination of Evidence BIBAFull-Text 661-669
  Ying Zhao; Phil Vines
Authorship attribution is a process of determining who wrote a particular document. We have found that different systems work well for particular sets of authors but not others. In this paper, we propose three authorship attribution systems, based on different ways of combining existing methodologies. All systems show better effectiveness than the state-of-art methods.


Cross-Document Entity Tracking BIBAFull-Text 670-673
  Roxana Angheluta; Marie-Francine Moens
The main focus of current work is to analyze useful features for linking and disambiguating person entities across documents. The more general problem of linking and disambiguating any kind of entity is known as entity detection and tracking (EDT) or noun phrase coreference resolution. EDT has applications in many important areas of information retrieval: clustering results in search engines when looking for a particular person; possibility to answer questions such as "Who was Woodward's source in the Plame scandal?" with "senior administration official" or "Richard Armitage" and information fusion from multiple documents. In current work person entities are limited to names and nominal entities. We emphasize the linguistic aspect of cross-document EDT: testing novel features useful in EDT across documents, such as the syntactic and semantic characteristics of the entities. The most important class of new features are contextual features, at varying levels of detail: events, related named-entities, and local context. The validity of the features is evaluated on a corpus annotated for cross-document coreference resolution of person names and nominals, and also on a corpus annotated only for names.
Enterprise People and Skill Discovery Using Tolerant Retrieval and Visualization BIBAFull-Text 674-677
  Jan Brunnert; Omar Alonso; Dirk Riehle
Understanding an enterprise's workforce and skill-set can be seen as the key to understanding an organization's capabilities. In today's large organizations it has become increasingly difficult to find people that have specific skills or expertise or to explore and understand the overall picture of an organization's portfolio of topic expertise. This article presents a case study of analyzing and visualizing such expertise with the goal of enabling human users to assess and quickly find people with a desired skill set. Our approach is based on techniques like n-grams, clustering, and visualization for improving the user search experience for people and skills.
Experimental Results of the Signal Processing Approach to Distributional Clustering of Terms on Reuters-21578 Collection BIBAKFull-Text 678-681
  Marta Capdevila Dalmau; Oscar W. Márquez Flórez
Distributional Clustering has showed to be an effective and powerful approach to supervised term extraction aimed at reducing the original indexing space dimensionality for Automatic Text Categorization [2]. In a recent paper [1] we introduced a new Signal Processing approach to Distributional Clustering which reached categorization results on 20 Newsgroups dataset similar to those obtained by other information-theoretic approaches [3][4][5]. Here we re-validate our method by showing that the 90-categories Reuters-21578 benchmark collection can be indexed with a minimum loss of categorization accuracy (around 2% with Naïve Bayes categorizer) with only 50 clusters.
Keywords: Automatic text categorization; Distributional clustering; Signal processing; Variance; Correlation coefficient
Overall Comparison at the Standard Levels of Recall of Multiple Retrieval Methods with the Friedman Test BIBAFull-Text 682-685
  José M. Casanova; Manuel A. Presedo Quindimil; Álvaro Barreiro
We propose a new application of the Friedman statistical test of significance to compare multiple retrieval methods. After measuring the average precision at the eleven standard levels of recall, our application of the Friedman test provides a global comparison of the methods. In some experiments this test provides additional and useful information to decide if methods are different.
Building a Desktop Search Test-Bed BIBAFull-Text 686-690
  Sergey Chernov; Pavel Serdyukov; Paul-Alexandru Chirita; Gianluca Demartini; Wolfgang Nejdl
In the last years several top-quality papers utilized temporary Desktop data and/or browsing activity logs for experimental evaluation. Building a common testbed for the Personal Information Management community is thus becoming an indispensable task. In this paper we present a possible dataset design and discuss the means to create it.
Hierarchical Browsing of Video Key Frames BIBAFull-Text 691-694
  Gianluigi Ciocca; Raimondo Schettini
We propose an innovative, general purpose, method to the selection and hierarchical representation of key frames of a video sequence for video summarization. It is able to create a hierarchical storyboard that the user may easily browse. The method is composed by three different steps. The first removes meaningless key frames, using supervised classification performed by a neural network on the basis of pictorial features and a visual attention model algorithm. The second step provides for the grouping of the key frames into clusters to allow multilevel summary using both low level and high level features. The third step identifies the default summary level that is shown to the users: starting from this set of key frames, the users can then browse the video content at different levels of detail.
Active Learning with History-Based Query Selection for Text Categorisation BIBAFull-Text 695-698
  Michael Davy; Saturnino Luz
Automated text categorisation systems learn a generalised hypothesis from large numbers of labelled examples. However, in many domains labelled data is scarce and expensive to obtain. Active learning is a technique that has shown to reduce the amount of training data required to produce an accurate hypothesis. This paper proposes a novel method of incorporating predictions made in previous iterations of active learning into the selection of informative unlabelled examples. We show empirically how this method can lead to increased classification accuracy compared to alternative techniques.
Fighting Link Spam with a Two-Stage Ranking Strategy BIBAFull-Text 699-702
  Guang-Gang Geng; Chun-Heng Wang; Qiu-Dan Li; Yuan-Ping Zhu
Most of the existing combating web spam techniques focus on the spam detection itself, which are separated from the ranking process. In this paper, we propose a two-stage ranking strategy, which makes good use of hyperlink information among Websites and Website's intra structure information. The proposed method incorporates web spam detection into the ranking process and penalizes the ranking score of potential spam pages, instead of removing them arbitrarily. Preliminary experimental results show that our method is feasible and effective.
Improving Naive Bayes Text Classifier Using Smoothing Methods BIBAFull-Text 703-707
  Feng He; Xiaoqing Ding
The performance of naive Bayes text classifier is greatly influenced by parameter estimation, while the large vocabulary and scarce labeled training set bring difficulty in parameter estimation. In this paper, several smoothing methods are introduced to estimate parameters in naive Bayes text classifier. The proposed approaches can achieve better and more stable performance than Laplace smoothing.
Term Selection and Query Operations for Video Retrieval BIBAFull-Text 708-711
  Bouke Huurnink; Maarten de Rijke
We investigate the influence of term selection and query operations on the text retrieval component of video search. Our main finding is that the greatest gain is to be found in the combination of character n-grams, stemmed text, and proximity terms.
An Effective Threshold-Based Neighbor Selection in Collaborative Filtering BIBAFull-Text 712-715
  Taek-Hun Kim; Sung-Bong Yang
In this paper we present a recommender system using an effective threshold-based neighbor selection in collaborative filtering. The proposed method uses the substitute neighbors of the test customer who may have an unusual preferences or who are the first rater. The experimental results show that the recommender systems using the proposed method find the proper neighbors and give a good prediction quality.
Combining Multiple Sources of Evidence in XML Multimedia Documents: An Inference Network Incorporating Element Language Models BIBAFull-Text 716-719
  Zhigang Kong; Mounia Lalmas
This work makes use of the semantic structure and logical structure in XML documents, and their combination to represent and retrieve XML multimedia content. We develop a Bayesian network incorporating element language models for the retrieval of a mixture of text and image. In addition, an element-based collection language model is used in the element language model smoothing. The proposed approach was successfully evaluated on the INEX 2005 multimedia data set.
Language Model Based Query Classification BIBAFull-Text 720-723
  Andreas Merkel; Dietrich Klakow
In this paper we propose a new way of using language models in query classification for question answering systems. We used a Bayes classifier as classification paradigm. Experimental results show that our approach outperforms current classification methods like Naive Bayes and SVM.
Integration of Text and Audio Features for Genre Classification in Music Information Retrieval BIBAFull-Text 724-727
  Robert Neumayer; Andreas Rauber
Multimedia content can be described in versatile ways as its essence is not limited to one view. For music data these multiple views could be a song's audio features as well as its lyrics. Both of these modalities have their advantages as text may be easier to search in and could cover more of the 'content semantics' of a song, while omitting other types of semantic categorisation. (Psycho)acoustic feature sets, on the other hand, provide the means to identify tracks that 'sound similar' while less supporting other kinds of semantic categorisation. Those discerning characteristics of different feature sets meet users' differing information needs. We will explain the nature of text and audio feature sets which describe the same audio tracks. Moreover, we will propose the use of textual data on top of low level audio features for music genre classification. Further, we will show the impact of different combinations of audio features and textual features based on content words.
Retrieval Method for Video Content in Different Format Based on Spatiotemporal Features BIBAFull-Text 728-731
  Xuefeng Pan; Jintao Li; Yongdong Zhang; Sheng Tang; Juan Cao
In this paper a robust video content retrieval method based on spatiotemporal features is proposed. To date, most video retrieval methods are using the character of video key frames. This kind of frame based methods is not robust enough for different video format. With our method, the temporal variation of visual information is presented using spatiotemporal slice. Then the DCT is used to extract feature of slice. With this kind of feature, a robust video content retrieval algorithm is developed. The experiment results show that the proposed feature is robust for variant video format.
Combination of Document Priors in Web Information Retrieval BIBAFull-Text 732-736
  Jie Peng; Iadh Ounis
Query independent features (also called document priors), such as the number of incoming links to a document, its PageRank, or the length of its associated URL, have been explored to boost the retrieval effectiveness of Web Information Retrieval (IR) systems. The combination of such query independent features could further enhance the retrieval performance. However, most current combination approaches are based on heuristics, which ignore the possible dependence between the document priors. In this paper, we present a novel and robust method for combining document priors in a principled way. We use a conditional probability rule, which is derived from Kolmogorov's axioms. In particular, we investigate the retrieval performance attainable by our combination of priors method, in comparison to the use of single priors and a heuristic prior combination method. Furthermore, we examine when and how document priors should be combined.
Enhancing Expert Search Through Query Modeling BIBAFull-Text 737-740
  Pavel Serdyukov; Sergey Chernov; Wolfgang Nejdl
An expert finding is a very common task among enterprise search activities, while its usual retrieval performance is far from the quality of the Web search. Query modeling helps to improve traditional document retrieval, so we propose to apply it in a new setting. We adopt a general framework of language modeling for expert finding. We show how expert language models can be used for advanced query modeling. A preliminary experimental evaluation on TREC Enterprise Track 2006 collection shows that our method improves the retrieval precision on the expert finding task.
A Hierarchical Consensus Architecture for Robust Document Clustering BIBAFull-Text 741-744
  Xavier Sevillano; Germán Cobo; Francesc Alías; Joan Claudi Socoró
A major problem encountered by text clustering practitioners is the difficulty of determining a priori which is the optimal text representation and clustering technique for a given clustering problem. As a step towards building robust document partitioning systems, we present a strategy based on a hierarchical consensus clustering architecture that operates on a wide diversity of document representations and partitions. The conducted experiments show that the proposed method is capable of yielding a consensus clustering that is comparable to the best individual clustering available even in the presence of a large number of poor individual labelings, outperforming classic non-hierarchical consensus approaches in terms of performance and computational cost.
Summarisation and Novelty: An Experimental Investigation BIBAFull-Text 745-748
  Simon Sweeney; Fabio Crestani; David E. Losada
The continued development of mobile device technologies, their supporting infrastructures and associated services is important to meet the anytime, anywhere information access demands of today's users. The growing need to deliver information on request, in a form that can be readily and easily digested on the move, continues to be a challenge.
A Layered Approach to Context-Dependent User Modelling BIBAKFull-Text 749-752
  Elena Vildjiounaite; Sanna Kallio
This works presents a method for explicit acquisition of context-dependent user preferences (preferences which change depending on a user situation, e.g., higher interest in outdoor activities if it is sunny than if it is raining) for Smart Home -- intelligent environment, which recognises contexts of its inhabitants (such as presence of people, activities, events, weather etc) via home and mobile devices and provides personalized proactive support to the users. Since a set of personally important situations, which affect user preferences, is user-dependent, and since many situations can be described only in fuzzy terms, we provide users with an easy way to develop personal context ontology and to map it fuzzily into common ontology via GUI. Backward mapping, by estimating the probability of occurrence of a user-defined situation, allows retrieval of preferences from all components of the user model.
Keywords: User Model; Context Awareness; Smart Home
A Bayesian Approach for Learning Document Type Relevance BIBAFull-Text 753-756
  Peter C. K. Yeung; Stefan Büttcher; Charles L. A. Clarke; Maheedhar Kolla
Retrieval accuracy can be improved by considering which document type should be filtered out and which should be ranked higher in the result list. Hence, document type can be used as a key factor for building a re-ranking retrieval model. We take a simple approach for considering document type in the retrieval process. We adapt the BM25 scoring function to weight term frequency based on the document type and take the Bayesian approach to estimate the appropriate weight for each type. Experimental results show that our approach improves on search precision by as much as 19%.