HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 1415161718192021222324252627282930313233

ACM Transactions on Information Systems 24

Editors:Gary Marchionini
Standard No:ISSN 1046-8188; HF S548.125 A33
Links:Table of Contents
  1. TOIS 2006 Volume 24 Issue 1
  2. TOIS 2006 Volume 24 Issue 2
  3. TOIS 2006 Volume 24 Issue 3
  4. TOIS 2006 Volume 24 Issue 4

TOIS 2006 Volume 24 Issue 1

Detection of video sequences using compact signatures BIBAFull-Text 1-50
  Justin Zobel; Timothy C. Hoad
Digital representations are widely used for audiovisual content, enabling the creation of large online repositories of video, allowing access such as video on demand. However, the ease of copying and distribution of digital video makes piracy a growing concern for content owners. We investigate methods for identifying coderivative video content -- that is, video clips that are derived from the same original source. By using dynamic programming to identify regions of similarity in video signatures, it is possible to efficiently and accurately identify coderivatives, even when these regions constitute only a small section of the clip being searched. We propose four new methods for producing compact video signatures, based on the way in which the video changes over time. The intuition is that such properties are likely to be preserved even when the video is badly degraded. We demonstrate that these signatures are insensitive to dramatic changes in video bitrate and resolution, two parameters that are often altered when reencoding. In the presence of mild degradations, our methods can accurately identify copies of clips that are as short as 5 s within a dataset 140 min long. These methods are much faster than previously proposed techniques; using a more compact signature, this query can be completed in a few milliseconds.
Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data BIBAFull-Text 51-78
  Tiziano Fagni; Raffaele Perego; Fabrizio Silvestri; Salvatore Orlando
This article discusses efficiency and effectiveness issues in caching the results of queries submitted to a Web search engine (WSE). We propose SDC (Static Dynamic Cache), a new caching strategy aimed to efficiently exploit the temporal and spatial locality present in the stream of processed queries. SDC extracts from historical usage data the results of the most frequently submitted queries and stores them in a static, read-only portion of the cache. The remaining entries of the cache are dynamically managed according to a given replacement policy and are used for those queries that cannot be satisfied by the static portion. Moreover, we improve the hit ratio of SDC by using an adaptive prefetching strategy, which anticipates future requests by introducing a limited overhead over the back-end WSE. We experimentally demonstrate the superiority of SDC over purely static and dynamic policies by measuring the hit ratio achieved on three large query logs by varying the cache parameters and the replacement policy used for managing the dynamic part of the cache. Finally, we deploy and measure the throughput achieved by a concurrent version of our caching system. Our tests show how the SDC cache can be efficiently exploited by many threads that concurrently serve the queries of different users.
A space-partitioning-based indexing method for multidimensional non-ordered discrete data spaces BIBAFull-Text 79-110
  Gang Qian; Qiang Zhu; Qiang Xue; Sakti Pramanik
There is an increasing demand for similarity searches in a multidimensional non-ordered discrete data space (NDDS) from application areas such as bioinformatics and data mining. The non-ordered and discrete nature of an NDDS raises new challenges for developing efficient indexing methods for similarity searches. In this article, we propose a new indexing technique, called the NSP-tree, to support efficient similarity searches in an NDDS. As we know, overlap causes a performance degradation for indexing methods (e.g., the R-tree) for a continuous data space. In an NDDS, this problem is even worse due to the limited number of elements available on each dimension of an NDDS. The key idea of the NSP-tree is to use a novel discrete space-partitioning (SP) scheme to ensure no overlap at each level in the tree. A number of heuristics and strategies are incorporated into the tree construction algorithms to deal with the challenges for developing an SP-based index tree for an NDDS. Our experiments demonstrate that the NSP-tree is quite promising in supporting efficient similarity searches in NDDSs. We have compared the NSP-tree with the ND-tree, a data-partitioning-based indexing technique for NDDSs that was proposed recently, and the linear scan using different NDDSs. It was found that the search performance of the NSP-tree was better than those of both methods.
Summary in context: Searching versus browsing BIBAFull-Text 111-141
  Daniel M. McDonald; Hsinchun Chen
The use of text summaries in information-seeking research has focused on query-based summaries. Extracting content that resembles the query alone, however, ignores the greater context of the document. Such context may be central to the purpose and meaning of the document. We developed a generic, a query-based, and a hybrid summarizer, each with differing amounts of document context. The generic summarizer used a blend of discourse information and information obtained through traditional surface-level analysis. The query-based summarizer used only query-term information, and the hybrid summarizer used some discourse information along with query-term information. The validity of the generic summarizer was shown through an intrinsic evaluation using a well-established corpus of human-generated summaries. All three summarizers were then compared in an information-seeking experiment involving 297 subjects. Results from the information-seeking experiment showed that the generic summaries outperformed all others in the browse tasks, while the query-based and hybrid summaries outperformed the generic summary in the search tasks. Thus, the document context of generic summaries helped users browse, while such context was not helpful in search tasks. Such results are interesting given that generic summaries have not been studied in search tasks and the that majority of Internet search engines rely solely on query-based summaries.

TOIS 2006 Volume 24 Issue 2

User evaluation of Fischlar-News: An automatic broadcast news delivery system BIBAFull-Text 145-189
  Hyowon Lee; Alan F. Smeaton; Noel E. O'Connor; Barry Smyth
Technological developments in content-based analysis of digital video information are undergoing much progress, with ideas for fully automatic systems now being proposed and demonstrated. Yet because we do not yet have robust operational video retrieval systems that can be deployed and used, the usual HCI practise of conducting a usage study and an informed iterative system design is thus not possible. Fischlar-News is one of the first automatic, content-based broadcast news analysis and archival systems that process broadcast news video so that users can search, browse, and play it in an easy-to-use manner with a conventional web browser. The system incorporates a number of state-of-the-art research components, some of which are not yet considered mature technology, yet it has been built to be robust enough to be deployed to users who are interested in access to daily news throughout a university campus. In this article we report and discuss a user-evaluation study conducted with 16 users, each of whom utilized the system freely for a one month period. Results from a detailed qualitative analysis are presented, looking at collected questionnaires, incident diaries, and interaction-log data. The findings suggest that our users employed the system in conjunction with their other news update methods, such as watching TV news at home and browsing online news websites at their workplace, their major concerns being up-to-dateness and coverage of the news content. They tried to accommodate the system to fit their established web browsing habits, and they found local news content and the ability to play self-contained news stories on their desktop as major values of the system. Our study also resulted in a detailed wishlist of new features which will help in the further development of both our and others' systems.
A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization BIBAFull-Text 190-218
  Sheng Gao; Wen Wu; Chin-Hui Lee; Tat-Seng Chua
We propose a maximal figure-of-merit (MFoM)-learning approach for robust classifier design, which directly optimizes performance metrics of interest for different target classifiers. The proposed approach, embedding the decision functions of classifiers and performance metrics into an overall training objective, learns the parameters of classifiers in a decision-feedback manner to effectively take into account both positive and negative training samples, thereby reducing the required size of positive training data. It has three desirable properties: (a) it is a performance metric, oriented learning; (b) the optimized metric is consistent in both training and evaluation sets; and (c) it is more robust and less sensitive to data variation, and can handle insufficient training data scenarios. We evaluate it on a text categorization task using the Reuters-21578 dataset. Training an F1-based binary tree classifier using MFoM, we observed significantly improved performance and enhanced robustness compared to the baseline and SVM, especially for categories with insufficient training samples. The generality for designing other metrics-based classifiers is also demonstrated by comparing precision, recall, and F1-based classifiers. The results clearly show consistency of performance between the training and evaluation stages for each classifier, and MFoM optimizes the chosen metric.
Enhancing relevance feedback in image retrieval using unlabeled data BIBAFull-Text 219-244
  Zhi-Hua Zhou; Ke-Jia Chen; Hong-Bin Dai
Relevance feedback is an effective scheme bridging the gap between high-level semantics and low-level features in content-based image retrieval (CBIR). In contrast to previous methods which rely on labeled images provided by the user, this article attempts to enhance the performance of relevance feedback by exploiting unlabeled images existing in the database. Concretely, this article integrates the merits of semisupervised learning and active learning into the relevance feedback process. In detail, in each round of relevance feedback two simple learners are trained from the labeled data, that is, images from user query and user feedback. Each learner then labels some unlabeled images in the database for the other learner. After retraining with the additional labeled data, the learners reclassify the images in the database and then their classifications are merged. Images judged to be positive with high confidence are returned as the retrieval result, while those judged with low confidence are put into the pool which is used in the next round of relevance feedback. Experiments show that using semisupervised learning and active learning simultaneously in CBIR is beneficial, and the proposed method achieves better performance than some existing methods.
iVIBRATE: Interactive visualization-based framework for clustering large datasets BIBAFull-Text 245-294
  Keke Chen; Ling Liu
With continued advances in communication network technology and sensing technology, there is astounding growth in the amount of data produced and made available through cyberspace. Efficient and high-quality clustering of large datasets continues to be one of the most important problems in large-scale data analysis. A commonly used methodology for cluster analysis on large datasets is the three-phase framework of sampling/summarization, iterative cluster analysis, and disk-labeling. There are three known problems with this framework which demand effective solutions. The first problem is how to effectively define and validate irregularly shaped clusters, especially in large datasets. Automated algorithms and statistical methods are typically not effective in handling these particular clusters. The second problem is how to effectively label the entire data on disk (disk-labeling) without introducing additional errors, including the solutions for dealing with outliers, irregular clusters, and cluster boundary extension. The third obstacle is the lack of research about issues related to effectively integrating the three phases. In this article, we describe iVIBRATE -- an interactive visualization-based three-phase framework for clustering large datasets. The two main components of iVIBRATE are its VISTA visual cluster-rendering subsystem which invites human interplay into the large-scale iterative clustering process through interactive visualization, and its adaptive ClusterMap labeling subsystem which offers visualization-guided disk-labeling solutions that are effective in dealing with outliers, irregular clusters, and cluster boundary extension. Another important contribution of iVIBRATE development is the identification of the special issues presented in integrating the two components and the sampling approach into a coherent framework, as well as the solutions for improving the reliability of the framework and for minimizing the amount of errors generated within the cluster analysis process. We study the effectiveness of the iVIBRATE framework through a walkthrough example dataset of a million records and we experimentally evaluate the iVIBRATE approach using both real-life and synthetic datasets. Our results show that iVIBRATE can efficiently involve the user in the clustering process and generate high-quality clustering results for large datasets.

TOIS 2006 Volume 24 Issue 3

Extraction of coherent relevant passages using hidden Markov models BIBAFull-Text 295-319
  Jing Jiang; Chengxiang Zhai
In information retrieval, retrieving relevant passages, as opposed to whole documents, not only directly benefits the end user by filtering out the irrelevant information within a long relevant document, but also improves retrieval accuracy in general. A critical problem in passage retrieval is to extract coherent relevant passages accurately from a document, which we refer to as passage extraction. While much work has been done on passage retrieval, the passage extraction problem has not been seriously studied. Most existing work tends to rely on presegmenting documents into fixed-length passages which are unlikely optimal because the length of a relevant passage is presumably highly sensitive to both the query and document.
   In this article, we present a new method for accurately detecting coherent relevant passages of variable lengths using hidden Markov models (HMMs). The HMM-based method naturally captures the topical boundaries between passages relevant and nonrelevant to the query. Pseudo-feedback mechanisms can be naturally incorporated into such an HMM-based framework to improve parameter estimation. We show that with appropriate parameter estimation, the HMM method outperforms a number of strong baseline methods on two datasets. We further show how the HMM method can be applied on top of any basic passage extraction method to improve passage boundaries.
Query enrichment for web-query classification BIBAFull-Text 320-352
  Dou Shen; Rong Pan; Jian-Tao Sun; Jeffrey Junfeng Pan; Kangheng Wu; Jie Yin; Qiang Yang
Web-search queries are typically short and ambiguous. To classify these queries into certain target categories is a difficult but important problem. In this article, we present a new technique called query enrichment, which takes a short query and maps it to intermediate objects. Based on the collected intermediate objects, the query is then mapped to target categories. To build the necessary mapping functions, we use an ensemble of search engines to produce an enrichment of the queries. Our technique was applied to the ACM Knowledge Discovery and Data Mining competition (ACM KDDCUP) in 2005, where we won the championship on all three evaluation metrics (precision, F1 measure, which combines precision and recall, and creativity, which is judged by the organizers) among a total of 33 teams worldwide. In this article, we show that, despite the difficulty of an abundance of ambiguous queries and lack of training data, our query-enrichment technique can solve the problem satisfactorily through a two-phase classification framework. We present a detailed description of our algorithm and experimental evaluation. Our best result for F1 and precision is 42.4% and 44.4%, respectively, which is 9.6% and 24.3% higher than those from the runner-ups, respectively.
CLAIRE: A modular support vector image indexing and classification system BIBAFull-Text 353-379
  Chih-Fong Tsai; Ken McGarry; John Tait
Many users of image retrieval systems would prefer to express initial queries using keywords. However, manual keyword indexing is very time-consuming. Therefore, a content-based image retrieval system which can automatically assign keywords to images would be very attractive. Unfortunately, it has proved very challenging to build such systems, except where either the image domain is restricted or the keywords relate only to low-level concepts such as color. This article presents a novel image indexing and classification system, called CLAIRE (CLAssifying Images for REtrieval), composed of one image processing module and three modules of support vector machines for color, texture, and high-level concept classification for keyword assignment. The experimental prototype system described here assigns up to five keywords selected from a controlled vocabulary of 60 terms to each image. The system is trained offline by 1639 examples from the Corel stock photo library. For evaluation, five judges reviewed a sample of 800 unknown images to identify which automatically assigned keywords were actually relevant to the image. The system proved to have an 80% probability to assign at least one relevant keyword to an image.
A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations BIBAFull-Text 380-404
  Hong Yu; Won Kim; Vasileios Hatzivassiloglou; John Wilbur
Abbreviations and acronyms are widely used in the biomedical literature and many of them represent important biomedical concepts. Because many abbreviations are ambiguous (e.g., CAT denotes both chloramphenicol acetyl transferase and computed axial tomography, depending on the context), recognizing the full form associated with each abbreviation is in most cases equivalent to identifying the meaning of the abbreviation. This, in turn, allows us to perform more accurate natural language processing, information extraction, and retrieval. In this study, we have developed supervised approaches to identifying the full forms of ambiguous abbreviations within the context they appear. We first automatically assigned multiple possible full forms for each abbreviation; we then treated the in-context full-form prediction for each specific abbreviation occurrence as a case of word-sense disambiguation. We generated automatically a dictionary of all possible full forms for each abbreviation. We applied supervised machine-learning algorithms for disambiguation. Because some of the links between abbreviations and their corresponding full forms are explicitly given in the text and can be recovered automatically, we can use these explicit links to automatically provide training data for disambiguating the abbreviations that are not linked to a full form within a text. We evaluated our methods on over 150 thousand abstracts and obtain for coverage and precision results of 82% and 92%, respectively, when performed as tenfold cross-validation, and 79% and 80%, respectively, when evaluated against an external set of abstracts in which the abbreviations are not defined.

TOIS 2006 Volume 24 Issue 4

Introduction to the special issue on XML retrieval BIBFull-Text 405-406
  Ricardo Baeza-Yates; Norbert Fuhr; Yoelle Maarek
Articulating information needs in XML query languages BIBAFull-Text 407-436
  Jaap Kamps; Maarten Marx; Maarten de Rijke; Borkur Sigurbjornson
Document-centric XML is a mixture of text and structure. With the increased availability of document-centric XML documents comes a need for query facilities in which both structural constraints and constraints on the content of the documents can be expressed. How does the expressiveness of languages for querying XML documents help users to express their information needs? We address this question from both an experimental and a theoretical point of view. Our experimental analysis compares a structure-ignorant with a structure-aware retrieval approach using the test suite of the INEX XML Retrieval Evaluation Initiative. Theoretically, we create two mathematical models of users' knowledge of a set of documents and define query languages which exactly fit these models. One of these languages corresponds to an XML version of fielded search, the other to the INEX query language.
   Our main experimental findings are: First, while structure is used in varying degrees of complexity, two-thirds of the queries can be expressed in a fielded-search-like format which does not use the hierarchical structure of the documents. Second, three-quarters of the queries use constraints on the context of the elements to be returned; these contextual constraints cannot be captured by ordinary keyword queries. Third, structure is used as a search hint, and not as a strict requirement, when judged against the underlying information need. Fourth, the use of structure in queries functions as a precision enhancing device.
Dynamic element retrieval in a structured environment BIBAFull-Text 437-454
  Carolyn J. Crouch
This research examines the feasibility of dynamic element retrieval in a structured environment. Structured documents and queries are represented in extended vector form, based on a modification of the basic vector space model suggested by Fox [1983]. A method for the dynamic retrieval of XML elements, which requires only a single indexing of the documents at the level of the basic indexing node, is presented. This method, which we refer to as flexible retrieval, produces a rank ordered list of retrieved elements that is equivalent to the result produced by the same retrieval against an all-element index of the collection. Flexible retrieval obviates the need for storing either an all-element index or multiple indices of the collection.
Preparing heterogeneous XML for full-text search BIBAFull-Text 455-474
  Miro Lehtonen
XML retrieval is facing new challenges when applied to heterogeneous XML documents, where next to nothing about the document structure can be taken for granted. We have developed solutions where some of the heterogeneity issues are addressed. Our fragment selection algorithm selectively divides a heterogeneous document collection into equi-sized fragments with full-text content. If the content is considered too data-oriented, it is not accepted. The algorithm needs no information about element names. In addition, three techniques for fragment expansion are presented, all of which yield a 13-17%; average improvement in average precision. These techniques and algorithms are among the first steps in developing document-type-independent indexing methods for the full text in heterogeneous XML collections.
A system for the static analysis of XPath BIBAFull-Text 475-502
  Pierre Geneves; Nabil Layaida
XPath is the standard language for navigating XML documents and returning a set of matching nodes. We present a sound and complete decision procedure for containment of XPath queries, as well as other related XPath decision problems such as satisfiability, equivalence, overlap, and coverage. The considered XPath fragment covers most of the language features used in practice. Specifically, we propose a unifying logic for XML, namely, the alternation-free modal μ-calculus with converse. We show how to translate major XML concepts such as XPath and regular XML types (including DTDs) into this logic. Based on these embeddings, we show how XPath decision problems, in the presence or absence of XML types, can be solved using a decision procedure for μ-calculus satisfiability. We provide a complexity analysis of our system together with practical experiments to illustrate the efficiency of the approach for realistic scenarios.
eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval BIBAFull-Text 503-542
  Gabriella Kazai; Mounia Lalmas
We propose and evaluate a family of measures, the eXtended Cumulated Gain (XCG) measures, for the evaluation of content-oriented XML retrieval approaches. Our aim is to provide an evaluation framework that allows the consideration of dependency among XML document components. In particular, two aspects of dependency are considered: (1) near-misses, which are document components that are structurally related to relevant components, such as a neighboring paragraph or container section, and (2) overlap, which regards the situation wherein the same text fragment is referenced multiple times, for example, when a paragraph and its container section are both retrieved. A further consideration is that the measures should be flexible enough so that different models of user behavior may be instantiated within. Both system- and user-oriented aspects are investigated and both recall and precision-like qualities are measured. We evaluate the reliability of the proposed measures based on the INEX 2004 test collection. For example, the effects of assessment variation and topic set size on evaluation stability are investigated, and the upper and lower bounds of expected error rates are established. The evaluation demonstrates that the XCG measures are stable and reliable, and in particular, that the novel measures of effort-precision and gain-recall (ep/gr) show comparable behavior to established IR measures like precision and recall.