HCI Bibliography Home | HCI Conferences | CLEF Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CLEF Tables of Contents: 101112131415

CLEF 2014: International Conference of the Cross-Language Evaluation Forum

Fullname:CLEF 2014: 5th International Conference of the CLEF Initiative. Information Access Evaluation -- Multilinguality, Multimodality, and Visualization
Editors:Evangelos Kanoulas; Mihai Lupu; Paul Clough; Mark Sanderson; Mark Hall; Allan Hanbury; Elaine Toms
Location:Sheffield, United Kingdom
Dates:2014-Sep-15 to 2014-Sep-18
Publisher:Springer International Publishing
Series:Lecture Notes in Computer Science 8685
Standard No:DOI: 10.1007/978-3-319-11382-1 hcibib: CLEF14; ISBN: 978-3-319-11381-4 (print), 978-3-319-11382-1 (online)
Links:Online Proceedings | Conference Website
  1. Evaluation
  2. Domain-Specific Approaches
  3. Alternative Search Tasks
  4. CLEF Lab Overviews


Making Test Corpora for Question Answering More Representative BIBAFull-Text 1-6
  Andrew Walker; Andrew Starkey; Jeff Z. Pan; Advaith Siddharthan
Despite two high profile series of challenges devoted to question answering technologies there remains no formal study into the representativeness that question corpora bear to real end-user inputs. We examine the corpora used presently and historically in the TREC and QALD challenges in juxtaposition with two more from natural sources and identify a degree of disjointedness between the two. We analyse these differences in depth before discussing a candidate approach to question corpora generation and provide a juxtaposition on its own representativeness. We conclude that these artificial corpora have good overall coverage of grammatical structures but the distribution is skewed, meaning performance measures may be inaccurate.
Towards Automatic Evaluation of Health-Related CQA Data BIBAFull-Text 7-18
  Alexander Beloborodov; Pavel Braslavski; Marina Driker
The paper reports on evaluation of Russian community question answering (CQA) data in health domain. About 1,500 question-answer pairs were manually evaluated by medical professionals, in addition automatic evaluation based on reference disease-medicine pairs was performed. Although the results of the manual and automatic evaluation do not fully match, we find the method still promising and propose several improvements. Automatic processing can be used to dynamically monitor the quality of the CQA content and to compare different data sources. Moreover, the approach can be useful for symptomatic surveillance and health education campaigns.
Rethinking How to Extend Average Precision to Graded Relevance BIBAFull-Text 19-30
  Marco Ferrante; Nicola Ferro; Maria Maistro
We present two new measures of retrieval effectiveness, inspired by Graded Average Precision (GAP), which extends Average Precision (AP) to graded relevance judgements. Starting from the random choice of a user, we define Extended Graded Average Precision (xGAP) and Expected Graded Average Precision (eGAP), which are more accurate than GAP in the case of a small number of highly relevant documents with high probability to be considered relevant by the users. The proposed measures are then evaluated on TREC 10, TREC 14, and TREC 21 collections showing that they actually grasp a different angle from GAP and that they are robust when it comes to incomplete judgments and shallow pools.
CLEF 15th Birthday: What Can We Learn From Ad Hoc Retrieval? BIBAFull-Text 31-43
  Nicola Ferro; Gianmaria Silvello
This paper reports the outcomes of a longitudinal study on the CLEF Ad Hoc track in order to assess its impact on the effectiveness of monolingual, bilingual and multilingual information access and retrieval systems. Monolingual retrieval shows a positive trend, even if the performance increase is not always steady from year to year; bilingual retrieval has demonstrated higher improvements in recent years, probably due to the better linguistic resources now available; and, multilingual retrieval exhibits constant improvement and performances comparable to bilingual (and, sometimes, even monolingual) ones.
An Information Retrieval Ontology for Information Retrieval Nanopublications BIBAFull-Text 44-49
  Aldo Lipani; Florina Piroi; Linda Andersson; Allan Hanbury
Retrieval experiments produce plenty of data, like various experiment settings and experimental results, that are usually not all included in the published articles. Even if they are mentioned, they are not easily machine-readable. We propose the use of IR nanopublications to describe in a formal language such information. Furthermore, to support the unambiguous description of IR domain aspects, we present a preliminary IR ontology. The use of the IR nanopublications will facilitate the assessment and comparison of IR systems and enhance the degree of reproducibility and reliability of IR research progress.
Supporting More-Like-This Information Needs: Finding Similar Web Content in Different Scenarios BIBAFull-Text 50-61
  Matthias Hagen; Christiane Glimm
We examine more-like-this information needs in different scenarios. A more-like-this information need occurs, when the user sees one interesting document and wants to access other but similar documents. One of our foci is on comparing different strategies to identify related web content. We compare following links (i.e., crawling), automatically generating keyqueries for the seen document (i.e., queries that have the document in the top of their ranks), and search engine operators that automatically display related results. Our experimental study shows that in different scenarios different strategies yield the most promising related results.
   One of our use cases is to automatically support people who monitor right-wing content on the web. In this scenario, it turns out that crawling from a given set of seed documents is the best strategy to find related pages with similar content. Querying or the related-operator yield much fewer good results. In case of news portals, however, crawling is a bad idea since hardly any news portal links to other news portals. Instead, a search engine's related operator or querying are better strategies. Finally, for identifying related scientific publications for a given paper, all three strategies yield good results.

Domain-Specific Approaches

SCAN: A Swedish Clinical Abbreviation Normalizer BIBAFull-Text 62-73
  Maria Kvist; Sumithra Velupillai
Abbreviations pose a challenge for information extraction systems. In clinical text, abbreviations are abundant, as this type of documentation is written under time-pressure. We report work on characterizing abbreviations in Swedish clinical text and the development of SCAN: a Swedish Clinical Abbreviation Normalizer, which is built for the purpose of improving information access systems in the clinical domain. The clinical domain includes several subdomains with differing vocabularies depending on the nature of the specialist work, and adaption of NLP-tools may consequently be necessary. We extend and adapt SCAN, and evaluate on two different clinical subdomains: emergency department (ED) and radiology (X-ray). Overall final results are 85% (ED) and 83% (X-ray) F1-measure on the task of abbreviation identification. We also evaluate coverage of abbreviation expansion candidates in existing lexical resources, and create two new, freely available, lexicons with abbreviations and their possible expansions for the two clinical subdomains.
A Study of Personalised Medical Literature Search BIBAFull-Text 74-85
  Richard McCreadie; Craig Macdonald; Iadh Ounis; Jon Brassey
Medical search engines are used everyday by both medical practitioners and the public to find the latest medical literature and guidance regarding conditions and treatments. Importantly, the information needs that drive medical search can vary between users for the same query, as clinicians search for content specific to their own area of expertise, while the public search about topics of interest to them. However, prior research into personalised search has so far focused on the Web search domain, and it is not clear whether personalised approaches will prove similarly effective in a medical environment. Hence, in this paper, we investigate to what extent personalisation can enhance medical search effectiveness. In particular, we first adapt three classical approaches for the task of personalisation in the medical domain, which leverage the user's clicks, clicks by similar users and explicit/implicit user profiles, respectively. Second, we perform a comparative user study with users from the TRIPDatabase.com medical article search engine to determine whether they outperform an effective baseline production system. Our results show that search result personalisation in the medical domain can be effective, with users stating a preference for personalised rankings for 68% of the queries assessed. Furthermore, we show that for the queries tested, users mainly preferred personalised rankings that promote recent content clicked by similar users, highlighting time as a key dimension of medical article search.
A Hybrid Approach for Multi-faceted IR in Multimodal Domain BIBAKFull-Text 86-97
  Serwah Sabetghadam; Ralf Bierig; Andreas Rauber
We present a model for multimodal information retrieval, leveraging different information sources to improve the effectiveness of a retrieval system. This method takes into account multifaceted IR in addition to the semantic relations present in data objects, which can be used to answer complex queries, combining similarity and semantic search. By providing a graph data structure and utilizing hybrid search in addition to structured search techniques, we take advantage of relations in data to improve retrieval. We tested the model with ImageCLEF 2011 Wikipedia collection, as a multimodal benchmark data collection, for an image retrieval task.
Keywords: Multimodal; Information Retrieval; Graph; Hybrid Search; Facet; Spreading Activation
Discovering Similar Passages within Large Text Documents BIBAKFull-Text 98-109
  Demetrios Glinos
We present a novel general method for discovering similar passages within large text documents based on adapting and extending the well-known Smith-Waterman dynamic programming local sequence alignment algorithm. We extend that algorithm for large document analysis by defining: (a) a recursive procedure for discovering multiple non-overlapping aligned passages within a given document pair; (b) a matrix splicing method for processing long texts; (c) a chaining method for combining sequence strands; and (d) an inexact similarity measure for determining token matches. We show that an implementation of this method is computationally efficient and produces very high precision with good recall for several types of order-based plagiarism and that it achieves higher overall performance than the best reported methods against the PAN 2013 text alignment test corpus.
Keywords: passage retrieval; text alignment; plagiarism detection
Improving Transcript-Based Video Retrieval Using Unsupervised Language Model Adaptation BIBAKFull-Text 110-115
  Thomas Wilhelm-Stein; Robert Herms; Marc Ritter; Maximilian Eibl
One challenge in automated speech recognition is to determine domain-specific vocabulary like names, brands, technical terms etc. by using generic language models. Especially in broadcast news new names occur frequently. We present an unsupervised method for a language model adaptation, which is used in automated speech recognition with a two-pass decoding strategy to improve spoken document retrieval on broadcast news. After keywords are extracted from each utterance, a web resource is queried to collect utterance-specific adaptation data. This data is used to augment the phonetic dictionary and adapt the basic language model. We evaluated this strategy on a data set of summarized German broadcast news using a basic retrieval setup.
Keywords: language modeling; out-of-vocabulary; spoken document retrieval; unsupervised adaptation

Alternative Search Tasks

Self-supervised Relation Extraction Using UMLS BIBAFull-Text 116-127
  Roland Roller; Mark Stevenson
Self-supervised relation extraction uses a knowledge base to automatically annotate a training corpus which is then used to train a classifier. This approach has been successfully applied to different domains using a range of knowledge bases. This paper applies the approach to the biomedical domain using UMLS, a large biomedical knowledge base containing millions of concepts and relations among them. The approach is evaluated using two different techniques. The presented results are promising and indicate that UMLS is a useful resource for semi-supervised relation extraction.
Authorship Identification Using Dynamic Selection of Features from Probabilistic Feature Set BIBAKFull-Text 128-140
  Hamed Zamani; Hossein Nasr Esfahani; Pariya Babaie; Samira Abnar; Mostafa Dehghani; Azadeh Shakery
Authorship identification was introduced as one of the important problems in the law and journalism fields and it is one of the major techniques in plagiarism detection. In this paper, to tackle the authorship verification problem, we propose a probabilistic distribution model to represent each document as a feature set to increase the interpretability of the results and features. We also introduce a distance measure to compute the distance between two feature sets. Finally, we exploit a KNN-based approach and a dynamic feature selection method to detect the features which discriminate the author's writing style.
   The experimental results on PAN at CLEF 2013 dataset show the effectiveness of the proposed method. We also show that feature selection is necessary to achieve an outstanding performance. In addition, we conduct a comprehensive analysis on our proposed dynamic feature selection method which shows that discriminative features are different for different authors.
Keywords: authorship identification; dynamic feature selection; k-nearest neighbors; probabilistic feature set
A Real-World Framework for Translator as Expert Retrieval BIBAFull-Text 141-152
  Navid Rekabsaz; Mihai Lupu
This article describes a method and tool to identify expert translators in an on-demand translation service. We start from existing efforts on expert retrieval and factor in additional parameters based on the real-world scenario of the task. The system first identifies topical expertise using an aggregation function over relevance scores of previously translated documents by each translator, and then a learning to rank method to factor in non-topical relevance factors that are part of the decision-making process of the user, such as price and duration of translation. We test the system on a manually created test collection and show that the method is able to effectively support the user in selecting the best translator.
Comparing Algorithms for Microblog Summarisation BIBAFull-Text 153-159
  Stuart Mackie; Richard McCreadie; Craig Macdonald; Iadh Ounis
Event detection and tracking using social media and user-generated content has received a lot of attention from the research community in recent years, since such sources can purportedly provide up-to-date information about events as they evolve, e.g. earthquakes. Concisely reporting (summarising) events for users/emergency services using information obtained from social media sources like Twitter is not a solved problem. Current systems either directly apply, or build upon, classical summarisation approaches previously shown to be effective within the newswire domain. However, to-date, research into how well these approaches generalise from the newswire to the microblog domain is limited. Hence, in this paper, we compare the performance of eleven summarisation approaches using four microblog summarisation datasets, with the aim of determining which are the most effective and therefore should be used as baselines in future research. Our results indicate that the SumBasic algorithm and Centroid-based summarisation with redundancy reduction are the most effective approaches, across the four datasets and five automatic summarisation evaluation measures tested.
The Effect of Dimensionality Reduction on Large Scale Hierarchical Classification BIBAKFull-Text 160-171
  Aris Kosmpoulos; Georgios Paliouras; Ion Androutsopoulos
Many classification problems are related to a hierarchy of classes, that can be exploited in order to perform hierarchical classification of test objects. The most basic way of hierarchical classification is that of cascade classification, which greedily traverses the hierarchy from root to the predicted leaf. In order to perform cascade classification, a classifier must be trained for each node of the hierarchy. In large scale problems, the number of features can be prohibitively large for the classifiers in the upper levels of the hierarchy. It is therefore desirable to reduce the dimensionality of the feature space at these levels. In this paper we examine the computational feasibility of the most common dimensionality reduction method (Principal Component Analysis) for this problem, as well as the computational benefits that it provides for cascade classification and its effect on classification accuracy. Our experiments on two benchmark datasets with a large hierarchy show that it is possible to perform a certain version of PCA efficiently in such large hierarchies, with a slight decrease in the accuracy of the classifiers. Furthermore, we show that PCA can be used selectively at the top levels of the hierarchy in order to decrease the loss in accuracy. Finally, the reduced feature space, provided by the PCA, facilitates the use of more costly and possibly more accurate classifiers, such as non-linear SVMs.
Keywords: Hierarchical Classification; Dimensionality Reduction; Principal Component Analysis

CLEF Lab Overviews

Overview of the ShARe/CLEF eHealth Evaluation Lab 2014 BIBAKFull-Text 172-191
  Liadh Kelly; Lorraine Goeuriot; Hanna Suominen; Tobias Schreck; Gondy Leroy; Danielle L. Mowery; Sumithra Velupillai; Wendy W. Chapman; David Martinez; Guido Zuccon; João Palotti
This paper reports on the 2nd ShARe/CLEFeHealth evaluation lab which continues our evaluation resource building activities for the medical domain. In this lab we focus on patients' information needs as opposed to the more common campaign focus of the specialised information needs of physicians and other healthcare workers. The usage scenario of the lab is to ease patients and next-of-kins' ease in understanding eHealth information, in particular clinical reports. The 1st ShARe/CLEFeHealth evaluation lab was held in 2013. This lab consisted of three tasks. Task 1 focused on named entity recognition and normalization of disorders; Task 2 on normalization of acronyms/abbreviations; and Task 3 on information retrieval to address questions patients may have when reading clinical reports. This year's lab introduces a new challenge in Task 1 on visual-interactive search and exploration of eHealth data. Its aim is to help patients (or their next-of-kin) in readability issues related to their hospital discharge documents and related information search on the Internet. Task 2 then continues the information extraction work of the 2013 lab, specifically focusing on disorder attribute identification and normalization from clinical text. Finally, this year's Task 3 further extends the 2013 information retrieval task, by cleaning the 2013 document collection and introducing a new query generation method and multilingual queries. De-identified clinical reports used by the three tasks were from US intensive care and originated from the MIMIC II database. Other text documents for Tasks 1 and 3 were from the Internet and originated from the Khresmoi project. Task 2 annotations originated from the ShARe annotations. For Tasks 1 and 3, new annotations, queries, and relevance assessments were created. 50, 79, and 91 people registered their interest in Tasks 1, 2, and 3, respectively. 24 unique teams participated with 1, 10, and 14 teams in Tasks 1, 2 and 3, respectively. The teams were from Africa, Asia, Canada, Europe, and North America. The Task 1 submission, reviewed by 5 expert peers, related to the task evaluation category of Effective use of interaction and targeted the needs of both expert and novice users. The best system had an Accuracy of 0.868 in Task 2a, an F1-score of 0.576 in Task 2b, and Precision at 10 (P@10) of 0.756 in Task 3. The results demonstrate the substantial community interest and capabilities of these systems in making clinical reports easier to understand for patients. The organisers have made data and tools available for future research and development.
Keywords: Information Retrieval; Information Extraction; Information Visualisation; Evaluation; Medical Informatics; Test-set Generation; Text Classification; Text Segmentation
ImageCLEF 2014: Overview and Analysis of the Results BIBAFull-Text 192-211
  Barbara Caputo; Henning Müller; Jesus Martinez-Gomez; Mauricio Villegas; Burak Acar; Novi Patricia; Neda Marvasti; Suzan Üsküdarli; Roberto Paredes; Miguel Cazorla; Ismael Garcia-Varea; Vicente Morell
This paper presents an overview of the ImageCLEF 2014 evaluation lab. Since its first edition in 2003, ImageCLEF has become one of the key initiatives promoting the benchmark evaluation of algorithms for the annotation and retrieval of images in various domains, such as public and personal images, to data acquired by mobile robot platforms and medical archives. Over the years, by providing new data collections and challenging tasks to the community of interest, the ImageCLEF lab has achieved an unique position in the image annotation and retrieval research landscape. The 2014 edition consists of four tasks: domain adaptation, scalable concept image annotation, liver CT image annotation and robot vision. This paper describes the tasks and the 2014 competition, giving a unifying perspective of the present activities of the lab while discussing future challenges and opportunities.
Overview of INEX 2014 BIBAFull-Text 212-228
  Patrice Bellot; Toine Bogers; Shlomo Geva; Mark Hall; Hugo Huurdeman; Jaap Kamps; Gabriella Kazai; Marijn Koolen; Véronique Moriceau; Josiane Mothe; Michael Preminger; Eric SanJuan; Ralf Schenkel; Mette Skov; Xavier Tannier; David Walsh
INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2014 evaluation campaign, which consisted of three tracks: The Interactive Social Book Search Track investigated user information seeking behavior when interacting with various sources of information, for realistic task scenarios, and how the user interface impacts search and the search experience. The Social Book Search Track investigated the relative value of authoritative metadata and user-generated content for search and recommendation using a test collection with data from Amazon and LibraryThing, including user profiles and personal catalogues. The Tweet Contextualization Track investigated tweet contextualization, helping a user to understand a tweet by providing him with a short background summary generated from relevant Wikipedia passages aggregated into a coherent summary. INEX 2014 was an exciting year for INEX in which we for the third time ran our workshop as part of the CLEF labs. This paper gives an overview of all the INEX 2014 tracks, their aims and task, the built test-collections, the participants, and gives an initial analysis of the results.
LifeCLEF 2014: Multimedia Life Species Identification Challenges BIBAFull-Text 229-249
  Alexis Joly; Hervé Goëau; Hervé Glotin; Concetto Spampinato; Pierre Bonnet; Willem-Pier Vellinga; Robert Planque; Andreas Rauber; Robert Fisher; Henning Müller
Using multimedia identification tools is considered as one of the most promising solutions to help bridging the taxonomic gap and build accurate knowledge of the identity, the geographic distribution and the evolution of living species. Large and structured communities of nature observers (e.g. eBird, Xeno-canto, Tela Botanica, etc.) as well as big monitoring equipments have actually started to produce outstanding collections of multimedia records. Unfortunately, the performance of the state-of-the-art analysis techniques on such data is still not well understood and is far from reaching the real world's requirements. The LifeCLEF lab proposes to evaluate these challenges around three tasks related to multimedia information retrieval and fine-grained classification problems in three living worlds. Each task is based on large and real-world data and the measured challenges are defined in collaboration with biologists and environmental stakeholders in order to reflect realistic usage scenarios. This paper presents more particularly the 2014 edition of LifeCLEF, i.e. the pilot one. For each of the three tasks, we report the methodology and the datasets as well as the official results and the main outcomes.
Benchmarking News Recommendations in a Living Lab BIBAFull-Text 250-267
  Frank Hopfgartner; Benjamin Kille; Andreas Lommatzsch; Till Plumbaum; Torben Brodt; Tobias Heintz
Most user-centric studies of information access systems in literature suffer from unrealistic settings or limited numbers of users who participate in the study. In order to address this issue, the idea of a living lab has been promoted. Living labs allow us to evaluate research hypotheses using a large number of users who satisfy their information need in a real context. In this paper, we introduce a living lab on news recommendation in real time. The living lab has first been organized as News Recommendation Challenge at ACM RecSys'13 and then as campaign-style evaluation lab NEWSREEL at CLEF'14. Within this lab, researchers were asked to provide news article recommendations to millions of users in real time. Different from user studies which have been performed in a laboratory, these users are following their own agenda. Consequently, laboratory bias on their behavior can be neglected. We outline the living lab scenario and the experimental setup of the two benchmarking events. We argue that the living lab can serve as reference point for the implementation of living labs for the evaluation of information access systems.
Improving the Reproducibility of PAN's Shared Tasks: BIBAFull-Text 268-299
  Martin Potthast; Tim Gollub; Francisco Rangel; Paolo Rosso; Efstathios Stamatatos; Benno Stein
This paper reports on the PAN 2014 evaluation lab which hosts three shared tasks on plagiarism detection, author identification, and author profiling. To improve the reproducibility of shared tasks in general, and PAN's tasks in particular, the Webis group developed a new web service called TIRA, which facilitates software submissions. Unlike many other labs, PAN asks participants to submit running softwares instead of their run output. To deal with the organizational overhead involved in handling software submissions, the TIRA experimentation platform helps to significantly reduce the workload for both participants and organizers, whereas the submitted softwares are kept in a running state. This year, we addressed the matter of responsibility of successful execution of submitted softwares in order to put participants back in charge of executing their software at our site. In sum, 57 softwares have been submitted to our lab; together with the 58 software submissions of last year, this forms the largest collection of softwares for our three tasks to date, all of which are readily available for further analysis. The report concludes with a brief summary of each task.
Overview of CLEF Question Answering Track 2014 BIBAFull-Text 300-306
  Anselmo Peñas; Christina Unger; Axel-Cyrille Ngonga Ngomo
This paper describes the CLEF QA Track 2014. In the current general scenario for the CLEF QA Track, the starting point is always a natural language question. However, answering some questions may need to query Linked Data (especially if aggregations or logical inferences are required), some questions may need textual inferences and querying free text, and finally, answering some queries may require both sources of information. The track was divided into three tasks: QALD focused on translating natural language questions into SPARQL queries; BioASQ focused on the biomedical domain, and Entrance Exams focused on answering questions to assess machine reading capabilities.
Overview of RepLab 2014: Author Profiling and Reputation Dimensions for Online Reputation Management BIBAKFull-Text 307-322
  Enrique Amigó; Jorge Carrillo-de-Albornoz; Irina Chugur; Adolfo Corujo; Julio Gonzalo; Edgar Meij; Maarten de Rijke; Damiano Spina
This paper describes the organisation and results of RepLab 2014, the third competitive evaluation campaign for Online Reputation Management systems. This year the focus lied on two new tasks: reputation dimensions classification and author profiling, which complement the aspects of reputation analysis studied in the previous campaigns. The participants were asked (1) to classify tweets applying a standard typology of reputation dimensions and (2) categorise Twitter profiles by type of author as well as rank them according to their influence. New data collections were provided for the development and evaluation of systems that participated in this benchmarking activity.
Keywords: RepLab; Reputation Management; Evaluation Methodologies and Metrics; Test Collections; Reputation Dimensions; Author Profiling; Twitter