ECDL 2003: Proceedings of the European Conference on Digital Libraries

Fullname:ECDL 2003: Research and Advanced Technology for Digital Libraries: 7th European Conference
Editors:Traugott Koch; Ingeborg Torvik Sølvberg
Location:Trondheim, Norway
Dates:2003-Aug-17 to 2003-Aug-22
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 2769
Standard No:DOI: 10.1007/b11967; ISBN: 978-3-540-40726-3 (print), 978-3-540-45175-4 (online); hcibib: ECDL03
  1. Uses, Users, and User Interaction
  2. Metadata Applications
  3. Annotation and Recommendation
  4. Automatic Classification and Indexing
  5. Web Technologies
  6. Topical Crawling, Subject Gateways
  7. Architectures and Systems
  8. Knowledge Organization: Concepts
  9. Collection Building and Management
  10. Knowledge Organization: Authorities and Works
  11. Information Retrieval in Different Application Areas
  12. Digital Preservation
  13. Indexing and Searching of Special Document and Collection Information

Uses, Users, and User Interaction

Users and Uses of Online Digital Libraries in France BIBAFull-Text 1-12
  Houssem Assadi; Thomas Beauvisage; Catherine Lupovici; Thierry Cloarec
This article presents a study of online digital library (DL) uses, based on three data sources (online questionnaire, Internet traffic data and interviews). We show that DL users differ from average Internet users as well as from classical library users, and that their practices involve particular contexts, among which personal researches and bibliophilism. These results lead us to reconsider the status of online documents, as well as the relationship between commercial and non-commercial Web sites. Digital libraries, far from being simple digital versions of library holdings, are now attracting a new type of public, bringing about new, unique and original ways for reading and understanding texts. They represent a new arena for reading and consultation of works alongside that of traditional libraries.
In Search for Patterns of User Interaction for Digital Libraries BIBAFull-Text 13-23
  Jela Steinerová
The paper provides preliminary results from a major study of the academic and research libraries users in Slovakia. The study is part of a larger research project on interaction of man and information environment. The goal of the research is to identify patterns of interaction of individuals and groups with information resources, derive models and information styles. The methodological model for questionnaire survey of users is described. The first results confirm the need to support user strategies, collaboration, different stages of information seeking and knowledge states, closer links with learning and problem solving, easy and flexible access, human creative processes of analysis, synthesis, interpretation, and the need to develop new knowledge organization structures.
Detecting Research Trends in Digital Library Readership BIBAFull-Text 24-28
  Johan Bollen; Richard Luce; Somasekhar Vemulapalli; Weining Xu
The research interests and preferences of the reader communities associated to any given digital library may change over the course of years. It is vital for digital library services and collection management to be informed of such changes, and to determine how they may point to future trends. We propose the Impact Discrepancy Ratio metric for the detection of research trends in a large digital library by comparing a reader-defined metric of journal impact to the Institute for Scientific Information Impact Factor (ISI IF) over the course of three years. An analysis for the Los Alamos National Laboratory (LANL) Research Library (RL) comparing reader impact to the ISI IF for 1998 and 2001 indicates journals relating to climatology have undergone a sharp increase in local impact. This evolution pinpoints specific shifts in the local strategies and reader interests of the LANL RL which were qualitatively validated by LANL RL management.
Evaluating the Changes in Knowledge and Attitudes of Digital Library Users BIBAFull-Text 29-40
  Gemma Madle; Patty Kostkova; Jane Mani-Saada; Julius R. Weinberg
Medical digital libraries are essentially life-critical applications providing timely access for professionals and the public to current medical knowledge and practice. This paper presents a new methodology for evaluating the impact of the knowledge within a medical digital library on users by testing their knowledge improvements and attitude changes. Using pre and post-use questionnaires we tested the impact of a small medical information website acting as an interface to the National electronic Library for Communicable Disease. The changes in user attitudes and the correlation with knowledge improvements observed indicate the potential for this methodology to be applied as a general evaluation technique of digital libraries and the impact of online information on user learning.

Metadata Applications

Towards a Role-Based Metadata Scheme for Educational Digital Libraries: A Case Study in Singapore BIBAFull-Text 41-51
  Dian Melati Md Ismail; Ming Yin; Yin Leng Theng; Dion Hoe-Lian Goh; Ee-Peng Lim
In this paper, we describe the development of an appropriate metadata scheme for GeogDL, a Web-based digital library application containing past-year examination resources for students taking a Singapore national examination in geography. The new metadata scheme was developed from established metadata schemes on education and e-learning. Initial evaluation showed that a role-based approach would be more viable, adapting to the different roles of teachers/educators and librarians contributing geography resources to GeogDL. The paper concludes with concrete implementation of the role-based metadata schema for GeogDL.
Incorporating Educational Vocabulary in Learning Object Metadata Schemes BIBAFull-Text 52-57
  Jian Qin; Carol Jean Godby
Educational metadata schemas are obligated to provide learning-related attributes in learning objects. The examination of current educational metadata standards found that few of them have places for incorporating educational vocabulary. Even within the educational category of metadata standards there is a lack of learning-related vocabulary for characterizing attributes that can help users identify the type of learning, objective, or context. The paper also discussed the problems with examples from a learning object taxonomy compiled by the authors.
Findings from the Mellon Metadata Harvesting Initiative BIBAFull-Text 58-69
  Martin Halbert; Joanne Kaczmarek; Kat Hagedorn
Findings are reported from four projects initiated through funding by the Andrew W. Mellon Foundation in 2001 to explore applications of metadata harvesting using the OAI-PMH. Metadata inconsistencies among providers have been encountered and strategies for normalization have been studied. Additional findings concerning harvesting are format conflicts, harvesting problems, provider system development, and questions regarding the entire cycle of metadata production, dissemination, and use (termed metadata gardening, rather than harvesting).
Semantic Browsing BIBAFull-Text 70-81
  Alexander Faaborg; Carl Lagoze
We have created software applications that allow users to both author and use Semantic Web metadata. To create and use a layer of semantic content on top of the existing Web, we have (1) implemented a user interface that expedites the task of attributing metadata to resources on the Web, and (2) augmented a Web browser to leverage this semantic metadata to provide relevant information and tasks to the user. This project provides a framework for annotating and reorganizing existing files, pages, and sites on the Web that is similar to Vannevar Bush's original concepts of trail blazing and associative indexing.
Metadata Editing by Schema BIBAFull-Text 82-87
  Hussein Suleman
Metadata creation and editing is a reasonably well-understood task which involves creating forms, checking the input data and generating appropriate storage formats. XML has largely become the standard storage representation for metadata records and various automatic mechanisms are becoming popular for validation of these records, including XML Schema and Schematron. However, there is no standard methodology for creating data manipulation mechanisms. This work presents a set of guidelines and extensions to use the XML Schema standard for this purpose. The experiences and issues involved in building such a generalised structured data editor are discussed, to support the notion that metadata editing, and not just validation, should be description-driven.

Annotation and Recommendation

Annotations: Enriching a Digital Library BIBAFull-Text 88-100
  Maristella Agosti; Nicola Ferro
This paper presents the results of a study on the semantics of the concept of annotation. It specifically deals with annotations in the context of digital libraries. In the light of those considerations, general characteristics and features of an annotation service are introduced. The OpenDLib digital library is adopted as a framework of reference for our ongoing research, so the paper presents the annotations extension to the OpenDLib digital library, where the extension regards both the adopted document model and the architecture. The final part of the paper discusses and evaluates if OpenDLib has the expressive power of representing the presented semantics of annotations.
Identifying Useful Passages in Documents Based on Annotation Patterns BIBAFull-Text 101-112
  Frank M., III Shipman; Morgan N. Price; Catherine C. Marshall; Gene Golovchinsky
Many readers annotate passages that are important to their work. If we understand the relationship between the types of marks on a passage and the passage's ultimate utility in a task, then we can design e-book software to facilitate access to the most important annotated parts of the documents. To investigate this hypothesis and to guide software design, we have analyzed annotations collected during an earlier study of law students reading printed case law and writing Moot Court briefs. This study has allowed us to characterize the relation-ship between the students' annotations and the citations they use in their final written briefs. We think of annotations that relate directly to the written brief as high-value annotations; these annotations have particular, detectable characteristics. Based on this study we have designed a mark parser that analyzes freeform digital ink to identify such high-value annotations.
Others Also Use: A Robust Recommender System for Scientific Libraries BIBAFull-Text 113-125
  Andreas Geyer-Schulz; Andreas W. Neumann; Anke Thede
Scientific digital library systems are a very promising application area for value-added expert advice services. Such systems could significantly reduce the search and evaluation costs of information products for students and scientists. This holds for pure digital libraries as well as for traditional scientific libraries with online public access catalogs (OPAC). In this contribution we first outline different types of recommendation services for scientific libraries and their general integration strategies. Then we focus on a recommender system based on log file analysis that is fully operational within the legacy library system of the Universität Karlsruhe (TH) since June 2002. Its underlying mathematical model, the implementation within the OPAC, as well as the first user evaluation is presented.

Automatic Classification and Indexing

Cross-Lingual Text Categorization BIBAFull-Text 126-139
  Núria Bel; Cornelis H. A. Koster; Marta Villegas
This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available.
   Experimental results of the bi-lingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bi-lingual training, terminology translation and profile-based translation.
Automatic Multi-label Subject Indexing in a Multilingual Environment BIBAFull-Text 140-151
  Boris Lauser; Andreas Hotho
This paper presents an approach to automatically subject index full-text documents with multiple labels based on binary support vector machines (SVM). The aim was to test the applicability of SVMs with a real world dataset. We have also explored the feasibility of incorporating multilingual background knowledge, as represented in thesauri or ontologies, into our text document representation for indexing purposes. The test set for our evaluations has been compiled from an extensive document base maintained by the Food and Agriculture Organization (FAO) of the United Nations (UN). Empirical results show that SVMs are a good method for automatic multi-label classification of documents in multiple languages.
Automatic Induction of Rules for Classification and Interpretation of Cultural Heritage Material BIBAFull-Text 152-163
  Stefano Ferilli; Floriana Esposito; Teresa Maria Altomare Basile; Nicola Di Mauro
This work presents the application of incremental symbolic learning strategies for the automatic induction of classification and interpretation rules in the cultural heritage domain. Specifically, such experience was carried out in the environment of the EU project COLLATE, in whose architecture the incremental learning system INTHELEX is used as a learning component. Results are reported, proving that the system was able to learn highly reliable rules for such a complex task.
An Integrated Digital Library Server with QAI and Self-Organizing Capabilities BIBAFull-Text 164-175
  Hyunki Kim; Chee-Yoong Choo; Su-Shing Chen
The Open Archives Initiative (OAI) is an experimental initiative for the interoperability of Digital Libraries (DLs) based on metadata harvesting. The goal of OAI is to develop and promote interoperability solutions to facilitate the efficient dissemination of content. At present, however, there are still several challenging issues such as metadata incorrectness, poor quality of metadata, and metadata inconsistency that have to be solved in order to create a variety of high-quality services. In this paper we propose an integrated DL system with OAI and self-organizing capabilities. The system provides two value-added services, cross-archive searching and interactive concept browsing services, for organizing, exploring, and searching a collection of harvested metadata to satisfy users' information needs. We also propose a multi-layered Self-Organizing Map (SOM) algorithm for building a subject-specific concept hierarchy using two input vector sets constructed by indexing the harvested metadata collection. By using the concept hierarchy, we can also automatically classify the harvested metadata collection for the purpose of selective harvesting.

Web Technologies

YAPI: Yet Another Path Index for XML Searching BIBAFull-Text 176-187
  Giuseppe Amato; Franca Debole; Pavel Zezula; Fausto Rabitti
As many metadata are encoded in XML, and many digital libraries need to manage XML documents, efficient techniques for searching in such formatted data are required. In order to efficiently process path expressions with wildcards on XML data, a new path index is proposed. Extensive evaluation confirms better performance with respect to other techniques proposed in the literature. An extension of the proposed technique to deal with the content of XML documents in addition to their structure is also discussed.
Structure-Aware Query for Digital Libraries: Use Cases and Challenges for the Humanities BIBAFull-Text 188-193
  Christopher York; Clifford E. Wulfman; Gregory Crane
Much recent research in database design focuses on persistence models for semistructured data similar to the SGML and XML that humanities digital libraries have long used to encode digital editions of texts. Structure-aware querying promises to simplify the design of such digital repositories by allowing them to store and query texts using a single, unified information model. Using content the Perseus Project has acquired over the past ten years as a test case, we describe the advantages and delimit the problems in managing structure-aware queries over multiple or ambiguous schemas, evaluate the place of markup in digital libraries where much content is automatically generated, and examine the uses for structure-aware query in a system that stores both semistructured content and graph-structured metadata.
Combining DAML+OIL, XSLT, and Probabilistic Logics for Uncertain Schema Mappings in MIND BIBAFull-Text 194-206
  Henrik Nottelmann; Norbert Fuhr
When distributed, heterogeneous digital libraries have to be integrated, one of the crucial tasks is to map between different schemas. As schemas may have different granularities, and as schema attributes do not always match precisely, a general-purpose schema mapping approach requires support for uncertain mappings. In this paper we present one of the very few approaches for defining and using uncertain schema mappings. We combine different technologies like DAML+OIL, probabilistic Datalog (since DAML+OIL -- as similar ontology languages -- lacks rules) and XSLT for actually transforming queries and documents. This declarative approach is fully implemented in the project MIND (which develops methods for retrieval in networked multimedia digital libraries). However, as DAML+OIL lacks some important features, the proposed approach is only a stepping stone for an integrated solution.
Digitometric Services for Open Archives Environments BIBAFull-Text 207-220
  Tim Brody; Simon Kampa; Stevan Harnad; Les Carr; Steve Hitchcock
We describe "digitometric" services and tools that add value to open-access eprint archives using the Open Archives Initiative (OAI) Protocol for Metadata Harvesting. Celestial is an OAI cache and gateway tool. Citebase Search enhances OAI-harvested metadata with linked references harvested from the full-text to provide a web service for citation navigation and research impact analysis. Digitometrics builds on data harvested using OAI to provide advanced visualisation and hypertext navigation for the research community. Together these services provide a modular, distributed architecture for building a "semantic web" for the research literature.

Topical Crawling, Subject Gateways

Search Engine-Crawler Symbiosis: Adapting to Community Interests BIBAFull-Text 221-232
  Gautam Pant; Shannon Bradshaw; Filippo Menczer
Web crawlers have been used for nearly a decade as a search engine component to create and update large collections of documents. Typically the crawler and the rest of the search engine are not closely integrated. If the purpose of a search engine is to have as large a collection as possible to serve the general Web community, a close integration may not be necessary. However, if the search engine caters to a specific community with shared focused interests, it can take advantage of such an integration. In this paper we investigate a tightly coupled system in which the crawler and the search engine engage in a symbiotic relationship. The crawler feeds the search engine and the search engine in turn helps the crawler to better its performance. We show that the symbiosis can help the system learn about a community's interests and serve such a community with better focus.
Topical Crawling for Business Intelligence BIBAFull-Text 233-244
  Gautam Pant; Filippo Menczer
The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. General-purpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-depth and up-to-date research. In this paper we investigate the use of topical crawlers in creating a small document collection that helps locate relevant business entities. The problem of locating business entities is encountered when an organization looks for competitors, partners or acquisitions. We formalize the problem, create a test bed, introduce metrics to measure the performance of crawlers, and compare the results of four different crawlers. Our results underscore the importance of identifying good hubs and exploiting link contexts based on tag trees for accelerating the crawl and improving the overall results.
SozioNet: Networking Social Science Resources BIBAFull-Text 245-256
  Wolfgang Meier; Natascha Schumann; Sue Heise; Rudi Schmiede
SozioNet forms part of a forthcoming national social science information portal, which is currently being developed by the German Infoconnex initiative. Inspired by successful examples like MathNet or SOSIG, SozioNet provides access to freely available web resources with relevance to social science. It is based on a network of social science institutions and scientists, to agree on and establish common metadata standards. SozioNet implements a general infrastructure for the creation of semantically rich metadata, and for the harvesting and retrieval of relevant resources with a domain specific focus.
VASCODA: A German Scientific Portal for Cross-Searching Distributed Digital Resource Collections BIBAFull-Text 257-262
  Heike Neuroth; Tamara Pianos
The German information science community -- with the support of the two main funding agencies in Germany -- will develop a scientific portal, vascoda, for cross-searching distributed metadata collections. In platitudinous words, one of the services of vascoda is going to be a "Google"-like search for the academic community, an easy to use, yet sophisticated search-engine to supply information on high-quality resources from different media and technical environments. Reaching this objective requires considerable standardisation activity amongst the main players to harmonise the already existing services (e.g. regarding metadata, protocols, etc.). The co-operation amongst the participants including both of the funding agencies is creating a unique team-work situation in Germany thus strengthening the information science community.

Architectures and Systems

Scenario-Based Generation of Digital Library Services BIBAFull-Text 263-275
  Rohit Kelapure; Marcos André Gonçalves; Edward A. Fox
We describe the development, implementation, and deployment of a new generic digital library generator yielding implementations of digital library services from models of DL "societies" and "scenarios". The distinct aspects of our solution are: 1) approach based on a formal, theoretical framework; 2) use of state-of-the-art database and software engineering techniques such as domain-specific declarative languages, scenario synthesis, componentized and model driven architectures; 3) analysis centered on scenario-based design and DL societal relationships; 4) automatic transformations and mappings from scenarios to workflow designs and from these to Java implementations, 5) special attention paid to issues of simplicity of implementation, modularity, reusability, and extensibility. We demonstrate the feasibility of the approach through a number of examples.
An Evaluation of Document Prefetching in a Distributed Digital Library BIBAFull-Text 276-287
  Jochen Hollmann; Anders Ardö; Per Stenström
Latency is a fundamental problem for all distributed systems including digital libraries. To reduce user perceived delays both caching -- keeping accessed objects for future use -- and prefetching -- transferring objects ahead of access time -- can be used. In a previous paper we have reported that caching is not worthwhile for digital libraries due to low re-access frequencies.
   In this paper we evaluate our previous findings that prefetching can be used instead. To do this we have set up an experimental prefetching proxy which is able to retrieve documents from remote fulltext archives before the user demands them. Using a simple prediction to keep the overhead of unnecessarily transfered data limited, we find that it is possible to cut the user perceived average delay a factor of two.
An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment BIBAFull-Text 288-299
  Michalis Sfakakis; Sarantos Kapidakis
The lack of information integration, by the existing online systems, for resource sharing in a distributed environment, impacts directly to the development and the usage of dynamically defined Virtual Union Catalogues. In this work we propose a design approach for the construction of an online system, able to improve the information integration when a Dynamic Resource Collection is used, by taking into account the restrictions imposed by the network environment and the Z39.50 protocol. The main strength of this architecture is the presentation of de-duplicated results to the user, by the gradual application of the duplicate detection process in small received packets (sets of results), as the data packets flow from the participating servers. While it presents results to the user, it also processes a limited amount of data ahead of time, to be ready before the user requests them.

Knowledge Organization: Concepts

The ADEPT Concept-Based Digital Learning Environment BIBAFull-Text 300-312
  Terence R. Smith; Dan Ancona; Olha A. Buchel; Michael Freeston; W. Heller; R. Nottrott; Tim Tierney; Alex Ushakov
We describe the design and application of a Digital Learning Environment (DLE) that is integrated with the collections and services of the Alexandria Digital Library (ADL). This DLE is in operational use in undergraduate teaching environments. Its design and development incorporate the assumption that deep understanding of both scientific phenomena and scientific methods is facilitated when learning materials are explicitly organized, accessed and presented at the level of granularity of appropriate sets of scientific concepts and their interrelationships. The DLE supports services for the creating, searching, and displaying: (1) knowledge bases (KBs) of strongly structured models of scientific concepts; (2) DL collections of information objects organized and accessible by the concepts of the KBs; and (3) collections of presentation materials, such as lectures and laboratory materials, that are organized as trajectories through the KB of concepts.
A User Evaluation of Hierarchical Phrase Browsing BIBAFull-Text 313-324
  Katrina D. Edgar; David M. Nichols; Gordon W. Paynter; Kirsten Thomson; Ian H. Witten
Phrase browsing interfaces based on hierarchies of phrases extracted automatically from document collections offer a useful compromise between automatic full-text searching and manually-created subject indexes. The literature contains descriptions of such systems that many find compelling and persuasive. However, evaluation studies have either been anecdotal, or focused on objective measures of the quality of automatically-extracted index terms, or restricted to questions of computational efficiency and feasibility. This paper reports on an empirical, controlled user study that compares hierarchical phrase browsing with full-text searching over a range of information seeking tasks. Users found the results located via phrase browsing to be relevant and useful but preferred keyword searching for certain types of queries. Users' experiences were marred by interface details, including inconsistencies between the phrase browser and the surrounding digital library interface.
Visual Semantic Modeling of Digital Libraries BIBAFull-Text 325-337
  Qinwei Zhu; Marcos André Gonçalves; Rao Shen; Lillian N. Cassel; Edward A. Fox
The current interest from non-experts who wish to build digital libraries (DLs) is strong worldwide. However, since DLs are complex systems, it usually takes considerable time and effort to create and tailor a DL to satisfy specific needs and requirements of target communities/societies. What is needed is a simplified modeling process and rapid generation of DLs. To enable this, DLs can be modeled with descriptive domain-specific languages. A visual tool would be helpful to non-experts so they may model a DL without knowing the theoretical foundations and the syntactic details of the descriptive language. In this paper, we present a domain-specific visual DL modeling tool, 5SGraph. It employs a metamodel that describes DLs using the 5S theory. The output from 5SGraph is a DL model that is an instance of the metamodel, expressed in the 5S description language. Furthermore, 5SGraph maintains semantic constraints specified by the 5S metamodel and enforces these constraints over the instance model to ensure semantic consistency and correctness. 5SGraph enables component reuse to reduce the time and effort of designers. 5SGraph also is designed to accommodate and integrate several other complementary tools reflecting the interdisciplinary nature of DLs. Thus, tools based on concept maps to fulfill those roles are introduced. The 5SGraph tool has been tested with real users and several modeling tasks in a usability experiment, and its usefulness and learnability have been demonstrated.

Collection Building and Management

Connecting Interface Metaphors to Support Creation of Path-Based Collections BIBAFull-Text 338-349
  Unmil Karadkar; Andruid Kerne; Richard Furuta; Luis Francisco-Revilla; Frank M., III Shipman; Jin Wang
Walden's Paths is a suite of tools that supports the creation and presentation of linear hypermedia paths -- targeted collections that enable authors to reorganize and contextualize Web-based information for presentation to an audience. Its current tools focus primarily on authoring and presenting paths, but not on the discovery and vetting of the materials that are included in the path. CollageMachine, on the other hand, focuses strongly on the exploration of Web spaces at the granularity of their media elements through presentation as a streaming collage, modified temporally through learning from user behavior. In this paper we present an initial investigation of the differences in expectations, assumptions, and work practices caused by the differing metaphors of browser based and CollageMachine Web search result representations, and how they affect the process of creating paths.
Managing Change in a Digital Library System with Many Interface Languages BIBAKFull-Text 350-361
  David Bainbridge; Katrina D. Edgar; John R. McPherson; Ian H. Witten
Managing the organizational and software complexity of a comprehensive open source digital library system presents a significant challenge. The challenge becomes even more imposing when the interface is available in different languages, for enhancements to the software and changes to the interface must be faithfully reflected in each language version. This paper describes the solution adopted by Greenstone, a multilingual digital library system distributed by UNESCO in a trilingual European version (English, French, Spanish), complete with all documentation, and whose interface is available in many further languages. Greenstone incorporates a language translation facility which allows authorized people to update the interface in specified languages. A standard version control system is used to manage software change, and from this the system automatically determines which language fragments need updating and presents them to the human translator.
Keywords: Digital library interfaces; multilingual systems; interface architecture; version control system
A Service for Supporting Virtual Views of Large Heterogeneous Digital Libraries BIBAFull-Text 362-373
  Leonardo Candela; Donatella Castelli; Pasquale Pagano
This paper presents an innovative type of digital library basic architectural service, the Collection Service, that supports the dynamic construction of customized virtual user views of the digital library. These views make transparent to the users the real DL content, services and their physical organization. By realizing the independency between the physical digital library and the digital library perceived by the user the Collection Service also creates the conditions for services optimization.
   The paper exemplifies this service by showing how it has been instantiated in the CYCLADES and SCHOLNET digital library systems.

Knowledge Organization: Authorities and Works

A Framework for Unified AuthorityFiles: A Case Study of Corporate Body Names in the FAO Catalogue BIBAFull-Text 374-386
  James Weinheimer; Kafkas Caprazli
We present a Unified Authority File for Names for use with the FAO Catalogue. This authority file will include all authorized forms of names, and can be used for highly precise resource discovery, as well as for record sharing. Other approaches of creating unified authority files are discussed. A major advantage of our proposal lies in the ease and sustainability of sharing records across authority files. The public would benefit from the Unified Authority File with its possibilities for cross-collection searching, and metadata creators would also have a greater possibility to utilize bibliographic records from other collections. A case study describes the treatment and use of corporate body names used in the catalogue of The Food and Agriculture Organization of the United Nations.
geoXwalk -- A Gazetteer Server and Service for UK Academia BIBAFull-Text 387-392
  James Reid
This paper will summarise work undertaken on behalf of the UK academic community to evaluate and develop a gazetteer server and service which will underpin geographic searching within the UK distributed academic information network. It will outline the context and problem domain, report on issues investigated and the findings to date. Lastly, it poses some unresolved questions requiring further research and speculates on possible future directions.
Utilizing Temporal Information in Topic Detection and Tracking BIBAFull-Text 393-404
  Juha Makkonen; Helena Ahonen-Myka
The harnessing of time-related information from text for the use of information retrieval requires a leap from the surface forms of the expressions to a formalized time-axis. Often the expressions are used to form chronological sequences of events. However, we want to be able to determine the temporal similarity, i.e., the overlap of temporal references of two documents and use this similarity in Topic Detection and Tracking, for example. We present a methodology for extraction of temporal expressions and a scheme of comparing the temporal evidence of the news documents. We also examine the behavior of the temporal expressions and run experiments on English News corpus.
Automatic Conversion from MARC to FRBR BIBAFull-Text 405-411
  Christian Mönch; Trond Aalberg
Catalogs have for centuries been the main tool that enabled users to search for items in a library by author, title, or subject. A catalog can be interpreted as a set of bibliographic records, where each record acts as a surrogate for a publication. Every record describes a specific publication and contains the data that is used to create the indexes of search systems and the information that is presented to the user. Bibliographic records are often captured and exchanged by the use of the MARC format. Although there are numerous "dialects" of the MARC format in use, they are usually crafted on the same basis and are interoperable with each other -- to a certain extent. The data model of a MARC-based catalog, however, is "[...] extremely non-normalized with excessive replication of data" [1]. For instance, a literary work that exists in numerous editions and translations is likely to yield a large result set because each edition or translation is represented by an individual record, that is unrelated to other records that describe the same work.

Information Retrieval in Different Application Areas

Musescape: A Tool for Changing Music Collections into Libraries BIBAFull-Text 412-421
  George Tzanetakis
Increases in hard disk capacity and audio compression technology have enabled the storage of large collections of music on personal computers and portable devices. As an example a portable device with 20 Gigabytes of storage can hold up to 4000 songs in compressed audio format. Currently the only way of structuring these collections is using a file system hierarchy which allows very limited forms of searching and retrieval. These limitations are even more pronounced in the case of portable devices where there is less screen real estate and user attention is limited compared to a personal computer.
   Musescape is a prototype tool for organizing and interacting with large music collections in audio format with specific emphasis on portable devices. It provides a variety of automatic and manual ways to organize and interact with large music collections using a consistent continuous audio feedback user interface for browsing, searching and annotating. Using this system a user can convert an unstructured or partially structured collection of music with limited retrieval capabilities into a music library with enhanced functionality.
A Digital GeoLibrary: Integrating Keywords and Place Names BIBAFull-Text 422-433
  Mathew Weaver; Lois M. L. Delcambre; Leonard D. Shapiro; Jason Brewster; Afrem Gutema; Timothy Tolle
A digital library typically includes a set of keywords (or subject terms) for each document in its collection(s). For some applications, including natural resource management, geographic location (e.g., the place of a study or a project) is very important. The metadata for such documents needs to indicate the location(s) associated with a document -- and users need to be able to search for documents by keyword as well as location. We have developed and implemented a digital library that supports -- but does not require -- georeferenceable documents (i.e., documents with reference to geography through the use of a textual place name). Because of their implicit spatial footprint, place names benefit from spatial reasoning and querying (e.g., to find all documents that describe work performed within a five-mile radius of a certain point) in addition to traditional keyword-based search. This paper presents the architecture for a digital library that combines spatial reasoning and selection with traditional (non-spatial) search. The contributions of this work are: (1) the use of a traditional geographic information system (GIS) for spatial processing rather than a specially tailored GIS system or a separate gazetteer and (2) the seamless integration of GIS with our thesaurus-based Metadata++ system, so users can easily take advantage of the strengths of both systems.
Document-Centered Collaboration for Scholars in the Humanities -- The COLLATE System BIBAFull-Text 434-445
  Ingo Frommholz; Holger Brocks; Ulrich Thiel; Erich J. Neuhold; Luigi Iannone; Giovanni Semeraro; Margherita Berardi; Michelangelo Ceci
In contrast to electronic document collections we find in contemporary digital libraries, systems applied in the cultural domain have to satisfy specific requirements with respect to data ingest, management, and access. Such systems should also be able to support the collaborative work of domain experts and furthermore offer mechanisms to exploit the value-added information resulting from a collaborative process like scientific discussions. In this paper, we present the solutions to these requirements developed and realized in the COLLATE system, where advanced methods for document classification, content management, and a new kind of context-based retrieval using scientific discourses are applied.

Digital Preservation

DSpace as an Open Archival Information System: Current Status and Future Directions BIBAFull-Text 446-460
  Robert Tansley; Mick Bass; MacKenzie Smith
As more and more output from research institutions is born digital, a means for capturing and preserving the results of this investment is required. To begin to understand and address the problems surrounding this task, Hewlett-Packard Laboratories collaborated with MIT Libraries over two years to develop DSpace, an open source institutional repository software system. This paper describes DSpace in the context of the Open Archival Information System (OAIS) reference model. Particular attention is given to the preservation aspects of DSpace, and the current status of the DSpace system with respect to addressing these aspects. The reasons for various design decisions and trade-offs that were necessary to develop the system in a timely manner are given, and directions for future development are explored. While DSpace is not yet a complete solution to the problem of preserving digital research output, it is a production-capable system, represents a significant step forward, and is an excellent platform for future research and development.
Preserving the Fabric of Our Lives: A Survey of Web BIBAFull-Text 461-472
  Michael Day
This paper argues that the growing importance of the World Wide Web means that Web sites are key candidates for digital preservation. After an brief outline of some of the main reasons why the preservation of Web sites can be problematic, a review of selected Web archiving initiatives shows that most current initiatives are based on combinations of three main approaches: automatic harvesting, selection and deposit. The paper ends with a discussion of issues relating to collection and access policies, software, costs and preservation.
Implementing Preservation Strategies for Complex Multimedia Objects BIBAFull-Text 473-486
  Jane Hunter; Sharmin Choudhury
Addressing the preservation and long-term access issues for digital resources is one of the key challenges facing informational organisations such as libraries, archives, cultural institutions and government agencies today. A number of major initiatives and projects have been established to investigate or develop strategies for preserving the burgeoning amounts of digital content being produced. To date, the alternative preservation approaches have been based on emulation, migration and metadata -- or some combination of these. Most of the work has focussed on digital objects of a singular media type: text, HTML, images, video or audio and to date few usable tools have been developed to support or implement such strategies or policies. In this paper we consider the preservation of composite, mixed-media, objects, a rapidly growing class of resources. Using three exemplars of new media artwork as case studies, we describe the optimum preservation strategies that we have determined for each exemplar and the software tools that we have developed to support and implement those strategies.

Indexing and Searching of Special Document and Collection Information

Distributed IR for Digital Libraries BIBAFull-Text 487-498
  Ray R. Larson
This paper examines technology developed to support large-scale distributed digital libraries. We describe the method used for harvesting collection information using standard information retrieval protocols and how this information is used in collection ranking and retrieval. The system that we have developed takes a probabilistic approach to distributed information retrieval using a Logistic regression algorithm for estimation of distributed collection relevance and fusion techniques to combine multiple sources of evidence. We discuss the harvesting method used and how it can be employed in building collection representatives using features of the Z39.50 protocol. The extracted collection representatives are ranked using a fusion of probabilistic retrieval methods. The effectiveness of our algorithm is compared to other distributed search methods using test collections developed for distributed search evaluation. We also describe how this system in currently being applied to operational systems in the U.K.
Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes BIBAFull-Text 499-510
  Shannon Bradshaw
Citation indexes are valuable tools for research, in part because they provide a means with which to measure the relative impact of articles in a collection of scientific literature. Recent efforts demonstrate some value in retrieval systems for citation indexes based on measures of impact. However, such approaches use weak measures of relevance, ranking together a few useful documents with many that are frequently cited but irrelevant. We propose an indexing technique that joins measures of relevance and impact in a single retrieval metric. This approach, called Reference Directed Indexing (RDI) is based on a comparison of the terms authors use in reference to documents. Initial retrieval experiments with RDI indicate that it retrieves documents of a quality on par with current ranking metrics, but with significantly improved relevance.
Space-Efficient Support for Temporal Text Indexing in a Document Archive Context BIBAFull-Text 511-522
  Kjetil Nørvåg
Support for temporal text-containment queries (query for all versions of documents that contained one or more particular words at a particular time t) is of interest in a number of contexts, including web archives, in a smaller scale temporal XML/web warehouses, and temporal document database systems in general. In the V2 temporal document database system we employed a combination of full-text indexes and variants of time indexes to perform efficient text-containment queries. That approach was optimized for moderately large temporal document databases. However, for "extremely large databases" the index space usage of the approach could be too large. In this paper, we present a more space-efficient solution to the problem: the interval-based temporal text index (ITTX). We also present appropriate algorithms for update and retrieval, and we discuss advantages and disadvantages of the V2 and ITTX approaches.
Clustering Top-Ranking Sentences for Information Access BIBAFull-Text 523-528
  Anastasios Tombros; Joemon M. Jose; Ian Ruthven
In this paper we propose the clustering of top-ranking sentences (TRS) for effective information access. Top-ranking sentences are selected by a query-biased sentence extraction model. By clustering such sentences, we aim to generate and present to users a personalised information space. We outline our approach in detail and we describe how we plan to utilise user interaction with this space for effective information access. We present an initial evaluation of TRS clustering by comparing its effectiveness at providing access to useful information to that of document clustering.