HCI Bibliography Home | HCI Conferences | ADCS Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ADCS Tables of Contents: 121314

Proceedings of ADCS'12, Australasian Document Computing Symposium

Fullname:Proceedings of the Seventeenth Australasian Document Computing Symposium
Editors:Andrew Trotman; Sally Jo Cunningham; Laurianne Sitbon
Location:Dunedin, New Zealand
Dates:2012-Dec-05 to 2012-Dec-06
Standard No:ISBN: 978-1-4503-1411-4; ACM DL: Table of Contents; hcibib: ADCS12
Links:Conference Website
Effects of spam removal on search engine efficiency and effectiveness BIBAFull-Text 1-8
  Matt Crane; Andrew Trotman
Spam has long been identified as a problem that web search engines are required to deal with. Large collection sizes are also an increasing issue for institutions that do not have the necessary resources to process them in their entirety. In this paper we investigate the effect that withholding documents identified as spam has on the resources required to process large collections. We also investigate the resulting search effectiveness and efficiency when different amounts of spam are withheld. We find that by removing spam at indexing time we are able to decrease the index size without affecting the indexing throughput, and are able to improve search precision for some thresholds.
Efficient indexing algorithms for approximate pattern matching in text BIBAFull-Text 9-16
  Matthias Petri; J. Shane Culpepper
Approximate pattern matching is an important computational problem with a wide variety of applications in Information Retrieval. Efficient solutions to approximate pattern matching can be applied to natural language keyword queries with spelling mistakes, OCR scanned text incorporated into indexes, language model ranking algorithms based on term proximity, or DNA databases containing sequencing errors. In this paper, we present a novel approach to constructing text indexes capable of efficiently supporting approximate search queries. Our approach relies on a new variant of the Context Bound Burrows-Wheeler Transform (k-bwt), referred to as the Variable Depth Burrows-Wheeler Transform (v-bwt). First, we describe our new algorithm, and show that it is reversible. Next, we show how to use the transform to support efficient text indexing and approximate pattern matching. Lastly, we empirically evaluate the use of the v-bwt for DNA and English text collections, and show a significant improvement in approximate search efficiency over more traditional q-gram based approximate pattern matching algorithms.
Reordering an index to speed query processing without loss of effectiveness BIBAFull-Text 17-24
  David Hawking; Timothy Jones
Following Long and Suel, we empirically investigate the importance of document order in search engines which rank documents using a combination of dynamic (query-dependent) and static (query-independent) scores, and use document-at-a-time (DAAT) processing. When inverted file postings are in collection order, assigning document numbers in order of descending static score supports lossless early termination while maintaining good compression.
   Since static scores may not be available until all documents have been gathered and indexed, we build a tool for reordering an existing index and show that it operates in less than 20% of the original indexing time. We note that this additional cost is easily recouped by savings at query processing time. We compare best early-termination points for several different index orders on three enterprise search collections (a whole-of-government index with two very different query sets, and a collection from a UK university). We also present results for the same orders for ClueWeb09-CatB. Our evaluation focuses on finding results likely to be clicked on by users of Web or website search engines -- Nav and Key results in the TREC 2011 Web Track judging scheme.
   The orderings tested are Original, Reverse, Random, and QIE (descending order of static score). For three enterprise search test sets we find that QIE order can achieve close-to-maximal search effectiveness with much lower computational cost than for other orderings. Additionally, reordering has negligible impact on compressed index size for indexes that contain position information. Our results for an artificial query set against the TREC ClueWeb09 Category B collection are much more equivocal and we canvass possible explanations for future investigation.
Comparing scanning behaviour in web search on small and large screens BIBAFull-Text 25-30
  Jaewon Kim; Paul Thomas; Ramesh Sankaranarayana; Tom Gedeon
Although web search on mobile devices is common, little is known about how users read search result lists on a small screen. We used eye tracking to compare users' scanning behaviour of web search engine result pages on a small screen (hand-held devices) and a large screen (desktops or laptops). The objective was to determine whether search result pages should be designed differently for mobile devices. To compare scanning behaviour, we considered only the fixation time and scanning strategy using our new method called 'Trackback'. The results showed that on a small screen, users spend relatively more time to conduct a search than they do on a large screen, despite tending to look less far ahead beyond the link that they eventually select. They also show a stronger tendency to seek information within the top three results on a small screen than on a large screen. The reason for this tendency may be difficulties in reading and the relative location of page folds. The results clearly indicated that scanning behaviour during web search on a small screen is different from that on a large screen. Thus, research efforts should be invested in improving the presentation of search engine result pages on small screens, taking scanning behaviour into account. This will help provide a better search experience in terms of search time, accuracy of finding correct links, and user satisfaction.
Explaining difficulty navigating a website using page view data BIBAFull-Text 31-38
  Paul Thomas
A user's behaviour on a web site can tell us something about that user's experience. In particular, we believe there are simple signals -- including circling back to previous pages, and swapping out to a search engine -- that indicate difficulty navigating a site.
   Simple page view patterns from web server logs correlate with these signals and may explain them. Extracting these patterns can help web authors understand where, and why, their sites are confusing or hard to navigate.
   We illustrate these ideas with data from almost a million sessions on a government website. In this case a small number of page view patterns are present in almost a third of difficult sessions, suggesting possible improvements to website language or design. We also introduce a tool for web authors, which makes this analysis available in the context of the site itself.
Relationship between the nature of the search task types and query reformulation behaviour BIBAFull-Text 39-46
  Khamsum Kinley; Dian Tjondronegoro; Helen Partridge; Sylvia Edwards
Success of query reformulation and relevant information retrieval depends on many factors, such as users' prior knowledge, age, gender, and cognitive styles. One of the important factors that affect a user's query reformulation behaviour is that of the nature of the search tasks. Limited studies have examined the impact of the search task types on query reformulation behaviour while performing Web searches. This paper examines how the nature of the search tasks affects users' query reformulation behaviour during information searching. The paper reports empirical results from a user study in which 50 participants performed a set of three Web search tasks -- exploratory, factorial and abstract. Users' interactions with search engines were logged by using a monitoring program. 872 unique search queries were classified into five query types -- New, Add, Remove, Replace and Repeat. Users submitted fewer queries for the factual task, which accounted for 26%. They completed a higher number of queries (40% of the total queries) while carrying out the exploratory task. A one-way MANOVA test indicated a significant effect of search task types on users' query reformulation behaviour. In particular, the search task types influenced the manner in which users reformulated the New and Repeat queries.
Models and metrics: IR evaluation as a user process BIBAFull-Text 47-54
  Alistair Moffat; Falk Scholer; Paul Thomas
Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. The former has the benefit of directly assessing the actual goal of the system, namely the user's ability to complete a search task; whereas the latter approach has the benefit of being quantitative and repeatable. Each given effectiveness metric is an attempt to bridge the gap between these two evaluation approaches, since the implicit belief supporting the use of any particular metric is that user task performance should be correlated with the numeric score provided by the metric. In this work we explore that linkage, considering a range of effectiveness metrics, and the user search behavior that each of them implies. We then examine more complex user models, as a guide to the development of new effectiveness metrics. We conclude by summarizing an experiment that we believe will help establish the strength of the linkage between models and metrics.
Sentence length bias in TREC novelty track judgements BIBAFull-Text 55-61
  Lorena Leal Bando; Falk Scholer; Andrew Turpin
The Cranfield methodology for comparing document ranking systems has also been applied recently to comparing sentence ranking methods, which are used as pre-processors for summary generation methods. In particular, the TREC Novelty track data has been used to assess whether one sentence ranking system is better than another. This paper demonstrates that there is a strong bias in the Novelty track data for relevant sentences to also be longer sentences. Thus, systems that simply choose the longest sentences will often appear to perform better in terms of identifying "relevant" sentences than systems that use other methods. We demonstrate, by example, how this can lead to misleading conclusions about the comparative effectiveness of sentence ranking systems. We then demonstrate that if the Novelty track data is split into subcollections based on sentence length, comparing systems on each of the subcollections leads to conclusions that avoid the bias.
Multi-aspect group formation using facility location analysis BIBAFull-Text 62-71
  Mahmood Neshati; Hamid Beigy; Djoerd Hiemstra
In this paper, we propose an optimization framework to retrieve an optimal group of experts to perform a given multi-aspect task/project. Each task needs a diverse set of skills and the group of assigned experts should be able to collectively cover all required aspects of the task. We consider three types of multi-aspect team formation problems and propose a unified framework to solve these problems accurately and efficiently. Our proposed framework is based on Facility Location Analysis (FLA) which is a well known branch of the Operation Research (OR). Our experiments on a real dataset show significant improvement in comparison with the state-of-the art approaches for the team formation problem.
An ontology derived from heterogeneous sustainability indicator set documents BIBAFull-Text 72-79
  Lida Ghahremanloo; James A. Thom; Liam Magee
We present an ontology to represent the key concepts of sustainability indicators that are increasingly being used to measure the economic, environmental and social properties of complex systems. There have been few efforts to represent multiple indicators formally, in spite of the fact that comparison of indicators and measurements across reporting contexts is a critical task. In this paper, we apply the METHONTOLOGY approach to guide the construction of two design candidates we term Generic and Specific. Of the two, the generic design is more abstract, with fewer classes and properties. Documents describing two indicator systems -- the Global Reporting Initiative and the Organisation for Economic Co-operation and Development -- are used in the design of both candidate ontologies. We then evaluate both ontology designs using the ROMEO approach, to calculate their level of coverage against the seen indicators, as well as against an unseen third indicator set (the United Nations Statistics Division). We also show that use of existing structured approaches like METHONTOLOGY and ROMEO can reduce ambiguity in ontology design and evaluation for domain-level ontologies. It is concluded that where an ontology needs to be designed for both seen and unseen indicator systems, a generic and reusable design is preferable.
Graph-based concept weighting for medical information retrieval BIBAFull-Text 80-87
  Bevan Koopman; Guido Zuccon; Peter Bruza; Laurianne Sitbon; Michael Lawley
This paper presents a graph-based method to weight medical concepts in documents for the purposes of information retrieval. Medical concepts are extracted from free-text documents using a state-of-the-art technique that maps n-grams to concepts from the SNOMED CT medical ontology. In our graph-based concept representation, concepts are vertices in a graph built from a document, edges represent associations between concepts. This representation naturally captures dependencies between concepts, an important requirement for interpreting medical text, and a feature lacking in bag-of-words representations.
   We apply existing graph-based term weighting methods to weight medical concepts. Using concepts rather than terms addresses vocabulary mismatch as well as encapsulates terms belonging to a single medical entity into a single concept. In addition, we further extend previous graph-based approaches by injecting domain knowledge that estimates the importance of a concept within the global medical domain.
   Retrieval experiments on the TREC Medical Records collection show our method outperforms both term and concept baselines. More generally, this work provides a means of integrating background knowledge contained in medical ontologies into data-driven information retrieval approaches.
A study in language identification BIBAFull-Text 88-95
  Rachel Mary Milne; Richard A. O'Keefe; Andrew Trotman
Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.
An attempt to measure the quality of questions in question time of the Australian Federal Parliament BIBAFull-Text 96-103
  Andrew Turpin
This paper uses standard information retrieval techniques to measure the quality of information exchange during Question Time in the Australian Federal Parliament's House of Representatives from 1998 to 2012. A search engine is used to index all answers to questions, and then runs each question as a query, recording the rank of the actual answer in the returned list of documents. Using this rank as a measure of quality, Question Time has deteriorated over the last decade. The main deterioration has been in information exchange in "Dorothy Dixer" questions. The corpus used for this study is available from the author's web page for further investigations.
An English-translated parallel corpus for the CJK Wikipedia collections BIBAFull-Text 104-110
  Ling-Xiang Tang; Shlomo Geva; Andrew Trotman
In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia.
Exploiting medical hierarchies for concept-based information retrieval BIBAFull-Text 111-114
  Guido Zuccon; Bevan Koopman; Anthony Nguyen; Deanne Vickers; Luke Butt
Search technologies are critical to enable clinical staff to rapidly and effectively access patient information contained in free-text medical records. Medical search is challenging as terms in the query are often general but those in relevant documents are very specific, leading to granularity mismatch.
   In this paper we propose to tackle granularity mismatch by exploiting subsumption relationships defined in formal medical domain knowledge resources. In symbolic reasoning, a subsumption (or 'is-a') relationship is a parent-child relationship where one concept is a subset of another concept. Subsumed concepts are included in the retrieval function. In addition, we investigate a number of initial methods for combining weights of query concepts and those of subsumed concepts. Subsumption relationships were found to provide strong indication of relevant information; their inclusion in retrieval functions yields performance improvements. This result motivates the development of formal models of relationships between medical concepts for retrieval purposes.
Finding additional semantic entity information for search engines BIBAFull-Text 115-122
  Jun Hou; Richi Nayak; Jinglan Zhang
Entity-oriented search has become an essential component of modern search engines. It focuses on retrieving a list of entities or information about the specific entities instead of documents. In this paper, we study the problem of finding entity related information, referred to as attribute-value pairs, that play a significant role in searching target entities. We propose a novel decomposition framework combining reduced relations and the discriminative model, Conditional Random Field (CRF), for automatically finding entity-related attribute-value pairs from free text documents. This decomposition framework allows us to locate potential text fragments and identify the hidden semantics, in the form of attribute-value pairs for user queries. Empirical analysis shows that the decomposition framework outperforms pattern-based approaches due to its capability of effective integration of syntactic and semantic features.
Is the unigram relevance model term independent?: classifying term dependencies in query expansion BIBAFull-Text 123-127
  Mike Symonds; Peter Bruza; Guido Zuccon; Laurianne Sitbon; Ian Turner
This paper develops a framework for classifying term dependencies in query expansion with respect to the role terms play in structural linguistic associations. The framework is used to classify and compare the query expansion terms produced by the unigram and positional relevance models. As the unigram relevance model does not explicitly model term dependencies in its estimation process it is often thought to ignore dependencies that exist between words in natural language.
   The framework presented in this paper is underpinned by two types of linguistic association, namely syntagmatic and paradigmatic associations. It was found that syntagmatic associations were a more prevalent form of linguistic association used in query expansion. Paradoxically, it was the unigram model that exhibited this association more than the positional relevance model. This surprising finding has two potential implications for information retrieval models: (1) if linguistic associations underpin query expansion, then a probabilistic term dependence assumption based on position is inadequate for capturing them; (2) the unigram relevance model captures more term dependency information than its underlying theoretical model suggests, so its normative position as a baseline that ignores term dependencies should perhaps be reviewed.
Pairwise similarity of TopSig document signatures BIBAFull-Text 128-134
  Christopher M. De Vries; Shlomo Geva
This paper analyses the pairwise distances of signatures produced by the TopSig retrieval model on two document collections. The distribution of the distances are compared to purely random signatures. It explains why TopSig is only competitive with state of the art retrieval models at early precision. Only the local neighbourhood of the signatures is interpretable. We suggest this is a common property of vector space models.
Putting the public into public health information dissemination: social media and health-related web pages BIBAFull-Text 135-138
  Robert Steele; Dan Dumbrell
Public health information dissemination represents an interesting combination of broadcasting, sharing, and retrieving relevant health information. Social media-based public health information dissemination offers some particularly interesting characteristics, as individual users or members of the public actually carry out the actions that constitute the dissemination. These actions also may inherently provide novel evaluative information from a document computing perspective, providing information in relation to both documents and indeed the social media users or health consumers themselves. This paper discusses the novel aspects of social media-based public health information dissemination, including a comparison of its characteristics with search engine-based Web document retrieval. A preliminary analysis of a sample of public health advice tweets taken from a larger sample of over 4700 tweets sent by Australian health-related organization in February 2012 is described. Various preliminary measures are analyzed from this data to initially suggest possible characteristics of public health information dissemination and document evaluation in micro-blog-based systems based on this sample.