HCI Bibliography Home | HCI Conferences | DL Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DL Tables of Contents: 9697989900010203040506070809101112131415

JCDL'12: Proceedings of the 2012 Joint International Conference on Digital Libraries

Fullname:Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Note:#preserving #linking #using #sharing
Editors:Karim B. Boughida; Barrie Howard; Michael L. Nelson; Herbert Van de Sompel; Ingeborg Sølvberg
Location:Washington, DC
Dates:2012-Jun-10 to 2012-Jun-14
Standard No:ISBN: 978-1-4503-1154-0; ACM DL: Table of Contents hcibib: DL12
Links:Conference Website
Summary:It is our great pleasure to welcome you to Washington, D.C., for the 12th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2012). JCDL is the premiere international conference focused on digital libraries, and associated organizational, practical, social, and technical issues. The event addresses a broad spectrum of topical areas, and is open to all emerging and established educators, industry leaders, researchers, and students working in the field of digital library research and development. This year's conference is hosted by The George Washington University and coorganized with The Library of Congress.
    We have chosen a number of keynotes to amplify our conference themes of #preserving #linking #using #sharing. The opening keynote will be given by Jason Scott of textfiles.com. On the second day the keynote will be delivered by Carole Goble of the University of Manchester, and on the third day a closing keynote will be presented by George Dyson, a frequent contributor to The Edge Foundation.
    This year, we received 201 submissions (with authors from 32 countries) that went through a rigorous review process with Program Committee members from 22 countries. The result is a program that reflects the high quality of research being conducted in the breadth of disciplines that comprise digital libraries. We accepted 25 of 91 full paper submissions (27%) and 22 of 70 short papers (34%). In addition, there will be 43 posters and 9 demos following the now legendary "minute madness" on the first day of the conference. The Vannevar Bush Best Paper Award and the Best Student Paper Award will be presented at the conference banquet.
    JCDL 2012 continues the tradition of supporting digital library developers with four tutorials that cover a range of timely issues including user studies, building digital libraries, and teaching about digital libraries. The program also includes five workshops, which provide a venue for a crosssection of disciplines to explore focused, cutting-edge topics including disciplinary repositories, emergency informatics, and institutional repositories. JCDL 2012 opens with a Doctoral Consortium, a forum where Ph.D. students receive feedback and advice from a committee of internationally recognized researchers and practitioners.
  1. Preservation
  2. Education
  3. Bibliographic networks
  4. Architecture
  5. Metadata
  6. Data
  7. Named entities
  8. Books and reading
  9. Concepts and topics
  10. Search
  11. Citations
  12. User behavior
  13. Posters
  14. Demonstrations


On the institutional archiving of social media BIBAFull-Text 1-10
  Catherine C. Marshall; Frank M. Shipman
Social media records the thoughts and activities of countless cultures and subcultures around the globe. Yet institutional efforts to archive social media content remain controversial. We report on 988 responses across six surveys of social media users that included questions to explore this controversy. The quantitative and qualitative results show that the way people think about the issue depends on how personal and ephemeral they view the content to be. They use concepts such as creator privacy, content characteristics, technological capabilities, perceived legal rights, and intrinsic social good to reason about the boundaries of institutional social media archiving efforts.
To envisage and design the transition from a digital archive system developed for domain experts to one for non-domain users BIBAFull-Text 11-14
  Maristella Agosti; Nicola Orio
Diverse digital resources are commonly used by different types of users. It is common practice to develop those application having in mind a set of requirements for a specific target category of users. We envisaged and designed the IPSA archive and system using a similar approach: the identification of a set of requirements for researchers in illuminated manuscripts as a target group of domain professional users. The IPSA system has been in use as a research tool by domain professionals. The consideration that the content of the archive managed by the IPSA system could be of interest for other types of users suggested reconsidering its approach to envisage a new system designed around the same archive of illuminated manuscripts for their access by diverse categories of users. The paper reports on the work that was conducted to re-design and re-engineer the system to match requirements and expectations of non-domain users.
Visualizing digital collections at archive-it BIBAFull-Text 15-18
  Kalpesh Padia; Yasmin AlNoamany; Michele C. Weigle
Archive-It, a subscription service from the Internet Archive, allows users to create, maintain and view digital collections of web resources. The current interface of Archive-It is largely text-based, supporting drill-down navigation using lists of URIs. To provide an overview of each collection and highlight the collection's underlying characteristics, we present four alternate visualizations (image plot with histogram, wordle, bubble chart and timeline). The sites in an Archive-It collection may be organized by the collection curator into groups for easier navigation. However, many collections do not have such groupings, making them difficult to explore. We introduce a heuristics-based categorization for such collections.
Data, data use, and scientific inquiry: two case studies of data practices BIBAFull-Text 19-22
  Laura A. Wynholds; Jillian C. Wallis; Christine L. Borgman; Ashley Sands; Sharon Traweek
Data are proliferating far faster than they can be captured, managed, or stored. What types of data are most likely to be used and reused, by whom, and for what purposes? Answers to these questions will inform information policy and the design of digital libraries. We report findings from semi-structured interviews and field observations to investigate characteristics of data use and reuse and how those characteristics vary within and between scientific communities. The two communities studied are researchers at the Center for Embedded Network Sensing (CENS) and users of the Sloan Digital Sky Survey (SDSS) data. The data practices of CENS and SDSS researchers have implications for data curation, system evaluation, and policy. Some data that are important to the conduct of research are not viewed as sufficiently valuable to keep. Other data of great value may not be mentioned or cited, because those data serve only as background to a given investigation. Metrics to assess the value of documents do not map well to data.
Digital preservation and knowledge discovery based on documents from an international health science program BIBAFull-Text 23-26
  Dharitri Misra; Robert H. Hall; Susan M. Payne; George R. Thoma
Important biomedical information is often recorded, published or archived in unstructured and semi-structured textual form. Artificial intelligence and knowledge discovery techniques may be applied to large volumes of such data to identify and extract useful metadata, not only for providing access to these documents, but also for conducting analyses and uncovering patterns and trends in a field. The System for Preservation of Electronic Resources (SPER), an information management tool developed at the U.S. National Library of Medicine, provides these capabilities by integrating machine learning, data mining and digital preservation techniques. In this paper, we present an overview of SPER and its ability to retrieve information from one such dataset. We show how SPER was applied to the semi-structured records of an international health science program, the 46-year continuous archive of conference publications and related documents from the Joint Cholera Panel of the U.S.-Japan Cooperative Medical Science Program (CMSP). We explain the techniques by which metadata was extracted automatically from the semi-structured document contents to preserve these publications, and show how such data was used to quantitatively describe the activity of a research community toward a preliminary study of a subset of its specific health science program goals.


Teacher sociality and information diffusion in educational digital libraries BIBAFull-Text 27-30
  Ogheneovo Dibie; Keith E. Maull; Tamara Sumner
Understanding the social aspects of digital resource utilization is an area of active research. In this study, we examine the digital library resource utilization and social behaviors of middle and high school Earth Science teachers of a large United States urban school district. We present the results of three analysis based on teachers using an online curriculum planning tool called the Curriculum Customization Service (CCS), and examine the social networks that emerge among the participating teachers. We explore these networks in the context of the digital library resources that were part of the CCS and the use of socio-centric features around those resources. Our initial findings show promise toward developing a broader understanding of the social networks of teachers, their behaviors around and usage of digital library resources, as well as the diffusion of information through those networks.
Is it time to change the OER repositories role? BIBAFull-Text 31-34
  Christo Dichev; Darina Dicheva
The growing number of digital libraries providing open educational resources (OER) requires effective resource discovery mechanisms to optimally exploit the benefits of their openness. This paper discusses the OER repositories role and presents a study aimed at understanding how educators find OER by seeking answers to questions such as: what proportion of users seeking OER go directly to OER repositories and what proportion uses search engines or some other means and why. Understanding how and with what tools users discover and access resources can have an impact on the OER repositories strategic development.
Identifying core concepts in educational resources BIBAFull-Text 35-42
  James M. Foster; Md. Arafat Sultan; Holly Devaul; Ifeyinwa Okoye; Tamara Sumner
This paper describes the results of a study designed to assess human expert ratings of educational concept features for use in automatic core concept extraction systems. Digital library resources provided the content base for human experts to annotate automatically extracted concepts on seven dimensions: coreness, local importance, topic, content, phrasing, structure, and function. The annotated concepts were used as training data to build a machine learning classifier as part of a tool used to predict the core concepts in the document. These predictions were compared with the experts' judgment of concept coreness.
Deduced social networks for an educational digital library BIBAFull-Text 43-46
  Monika Akbar; Clifford A. Shaffer; Edward A. Fox
By analyzing the behavior of previous users, digital libraries can be made to provide new users with more support to find the best information. The AlgoViz Portal collects metadata on algorithm visualizations and associated research literature. We show how logs can be used to discover latent relationships between users, deducing an implicit social network. By clustering the log data, we find different page-viewing patterns, which provide practical information about the different groups of users.
A tale of two studies: is dissemination working? BIBAFull-Text 47-50
  Flora McMartin; Sarah Giersch; Joseph Tront; Wesley Shumar
In this paper we describe preliminary results from two ongoing research projects that investigate the dissemination practices surrounding digital STEM learning materials for undergraduates. This research consists of two related studies: 1) survey research about the dissemination practices of NSF-funded PIs; and, 2) a case study on the dissemination practices of courseware developers who won the Premier Award for Excellence in Engineering Education. The vast majority of PIs reported in the survey that they do not take advantage of digital dissemination methods such as education digital libraries. Premier Award-winning innovators reported using multiple dissemination methods -- traditional and digital. Recommendations are provided regarding how digital library developers might work with PIs to improve dissemination.

Bibliographic networks

To better stand on the shoulder of giants BIBAFull-Text 51-60
  Rui Yan; Congrui Huang; Jie Tang; Yan Zhang; Xiaoming Li
Usually scientists breed research ideas inspired by previous publications, but they are unlikely to follow all publications in the unbounded literature collection. The volume of literature keeps on expanding extremely fast, whilst not all papers contribute equal impact to the academic society. Being aware of potentially influential literature would put one in an advanced position in choosing important research references. Hence, estimation of potential influence is of great significance. We study a challenging problem of identifying potentially influential literature. We examine a set of hypotheses on what are the fundamental characteristics for highly cited papers and find some interesting patterns. Based on these observations, we learn to identify potentially influential literature via Future Influence Prediction (FIP), which aims to estimate the future influence of literature. The system takes a series of features of a particular publication as input and produces as output the estimated citation counts of that article after a given time period. We consider several regression models to formulate the learning process and evaluate their performance based on the coefficient of determination (R2). Experimental results on a real-large data set show a mean average predictive performance of 83.6% measured in R^2. We apply the learned model to the application of bibliography recommendation and obtain prominent performance improvement in terms of Mean Average Precision (MAP).
BibRank: a language-based model for co-ranking entities in bibliographic networks BIBAFull-Text 61-70
  Laure Soulier; Lamjed Ben Jabeur; Lynda Tamine; Wahiba Bahsoun
Bibliographic documents are basically associated with many entities including authors, venues, affiliations, etc. While bibliographic search engines addressed mainly relevant document ranking according to a query topic, ranking other related relevant bibliographic entities is still challenging. Indeed, document relevance is the primary level that allows inferring the relevance of the other entities regardless of the query topic. In this paper, we propose a novel integrated ranking model, called BibRank, that aims at ranking both document and author entities in bibliographic networks. The underlying algorithm propagates entity scores through the network by means of citation and authorship links. Moreover, we propose to weight these relationships using content-based indicators that estimate the topical relatedness between entities. In particular, we estimate the common similarity between homogeneous entities by analyzing marginal citations. We also compare document and author language models in order to evaluate the level of author's knowledge on the document topic and the document representativeness of author's knowledge. Experiment results on the representative CiteSeerX dataset show that BibRank model outperforms baseline ranking models with a significant improvement.
Modeling and exploiting heterogeneous bibliographic networks for expertise ranking BIBAFull-Text 71-80
  Hongbo Deng; Jiawei Han; Michael R. Lyu; Irwin King
Recently expertise retrieval has received increasing interests in both academia and industry. Finding experts with demonstrated expertise for a given query is a nontrivial task especially from a large-scale Web 2.0 systems, such as question answering and bibliography data, where users are actively publishing useful content online, interacting with each other, and forming social networks in various ways, leading to heterogeneous networks in addition to the large amounts of textual content information. Many approaches have been proposed and shown to be useful for expertise ranking. However, most of these methods only consider the textual documents while ignoring heterogeneous network structures or can merely integrate with one additional kind of information. None of them can fully exploit the characteristics of heterogeneous networks. In this paper, we propose a joint regularization framework to enhance expertise retrieval by modeling heterogeneous networks as regularization constraints on top of document-centric model. We argue that multi-typed linking edges reveal valuable information which should be treated differently. Motivated by this intuition, we formulate three hypotheses to capture unique characteristics for different graphs, and mathematically model those hypotheses jointly with the document and other information. To illustrate our methodology, we apply the framework to expert finding applications using a bibliography dataset with 1.1 million papers and 0.7 million authors. The experimental results show that our proposed approach can achieve significantly better results than the baseline and other enhanced models.


Live television in a digital library BIBAFull-Text 81-90
  Maxime Roüast; David Bainbridge
The number of channels of digital television is increasing, particularly the number that are free-to-air. However due to the nature of broadcasting, this morass of information is not, for the main part, organized -- it is principally a succession of images and sound transmitted as multiplexed streams of data. Compare this deluge that terrestrially bombards our homes with the information available in the digital libraries we access over the Internet -- stored using software purpose built to help organize carefully curated sets of documents. This project brings together these two seemingly incompatible concepts to develop a software environment that concurrently captures all the available live television channels -- so a user does not need to proactively choose what to record -- and segments them into files which are then imported into a digital video library with a user interface designed to work from a multimedia remote control. A shifting time-based "window" of all recordings is maintained -- we settled on from the last two weeks so as to be practicably operable on a regular desktop PC. The system leverages off the information contained in the electronic program guide and the video recordings to generate metadata suitable for the digital library. A user evaluation of the developed prototype showed a high level of participant satisfaction across a range of attributes, notably date-based searching.
Transforming Japanese archives into accessible digital books BIBAFull-Text 91-100
  Tatsuya Ishihara; Toshinari Itoko; Daisuke Sato; Asaf Tzadok; Hironobu Takagi
Digitized physical books offer access to tremendous amounts of knowledge, even for people with print-related disabilities. Various projects and standard activities are underway to make all of our past and present books accessible. However digitizing books requires extensive human efforts such as correcting the results of OCR (optical character recognition) and adding structural information such as headings. Some Asian languages need extra efforts for the OCR errors because of their many and varied character sets. Japanese has used more than 10,000 characters compared with a few hundred in English. This heavy workload is inhibiting the creation of accessible digital books. To facilitate digitization, we are developing a new system for processing physical books. We reduce and disperse the human efforts and accelerate conversions by combining automatic inference and human capabilities. Our system preserves the original page images for the entire digitization process to support gradual refinement and distributes the work as micro-tasks. We conducted trials with the Japanese National Diet Library (NDL) to evaluate the required effort for digitizing books with a variety of layouts and years of publication. The results showed old Japanese books had specific problems when correcting the OCR errors and adding structures. Drawing on our results, we discuss further workload reductions and future directions for international digitization systems.
IPKB: a digital library for invertebrate paleontology BIBAFull-Text 101-110
  Yuanliang Meng; Junyan Li; Patrick Denton; Yuxin Chen; Bo Luo; Paul Selden; Xue-wen Chen
In this paper, we present the Invertebrate Paleontology Knowledgebase (IPKB), an effort to digitize and share the Treatise on Invertebrate Paleontology. The Treatise is the most authoritative compilation of invertebrate fossil records. Unfortunately, the PDF version is simply a clone of paper publications and the content is in no way organized to facilitate search and knowledge discovery. We extracted texts and images from the Treatise, stored them in a database, and built a system for efficient browsing and searching. For image processing in particular, we segmented fossil photos from figures, recognized the embedded labels, and linked the images to the corresponding data entries. The detailed information of each genus, including fossil images, is delivered to users through a web access module. Some external applications (e.g. Google Earth) are acquired through web services APIs to improve user experience. Given the rich information in the Treatise, analyzing, modeling and understanding paleontological data are significant in many areas, such as: understanding evolution; understanding climate change; finding fossil fuels, etc. IPKB builds a general framework that aims to facilitate knowledge discovery activities in invertebrate paleontology, and provides a solid foundation for future explorations. In this article, we report our initial accomplishments. The specific techniques we employed in the project, such as those involved in text parsing, image-label association and meta data extraction, can be insightful and serve as examples for other researchers.


Descriptive metadata, iconclass, and digitized emblem literature BIBAFull-Text 111-120
  Timothy W. Cole; Myung-Ja K. Han; Jordan A. Vannoy
Early Modern emblems combined text and image. Though there were many variants, the archetypical emblem literary form (mid-sixteenth through mid-eighteenth centuries) consisted of an image (the pictura), a text inscription (the inscriptio), and a text epigram (the subscriptio), the last usually in verse. Digitized emblem literature poses interesting challenges as regards content and metadata granularity, the use of interdisciplinary controlled vocabularies, and the need to present digitized primary sources in a complex network of associated sources, derivatives, and contemporaneous context. In this paper, we describe a digital library Web application designed to better support the ways emblem scholars search for and use digitized emblem books, focusing on metadata design, issues of resource granularity and identification, and the use of Linked Data Web services for Iconclass, a multilingual classification system for cultural heritage art and images. Outcomes to date, achieved by emblem scholars and librarians working in collaboration, provide a case study for multi-faceted, interactive approaches to curating mixed text-image digital resources and the use of Linked Data vocabulary services. Lessons learned highlight the value of librarian-scholar collaboration and help to illustrate why digital libraries need to move beyond merely disseminating digitized book surrogates.
Categorization of computing education resources with utilization of crowdsourcing BIBAFull-Text 121-124
  Yinlin Chen; Paul Logasa, II Bogen; Haowei Hsieh; Edward A. Fox; Lillian N. Cassel
The Ensemble Portal harvests resources from multiple heterogeneous federated collections. Managing these dynamically increasing collections requires an automatic mechanism to categorize records in to corresponding topics. We propose an approach to use existing ACM DL metadata to build classifiers for harvested resources in the Ensemble project. We also present our experience with utilizing the Amazon Mechanical Turk platform to build ground truth training data sets from Ensemble collections.
Re-ranking bibliographic records for personalized library search BIBAFull-Text 125-128
  Tadashi Nomoto
This work will introduce a new approach to ranking bibliographic records in library search, which is currently dominated by an OPAC style search paradigm, where results are typically not ranked by relevance. The approach we propose in the paper provides the user with the ability to access bibliographic records in a way responsive to his or her preferences, which is essentially done by looking at a community or a group of people who share interests with the user and making use of their publication records to re-rank search results. The experiment found that the present approach gives a clear edge over conventional search methods.
Generating ground truth for music mood classification using mechanical turk BIBAFull-Text 129-138
  Jin Ha Lee; Xiao Hu
Mood is an important access point in music digital libraries and online music repositories, but generating ground truth for evaluating various music mood classification algorithms is a challenging problem. This is because collecting enough human judgments is time-consuming and costly due to the subjectivity of music mood. In this study, we explore the viability of crowdsourcing music mood classification judgments using Amazon Mechanical Turk (MTurk). Specifically, we compare the mood classification judgments collected for the annual Music Information Retrieval Evaluation eXchange (MIREX) with judgments collected using MTurk. Our data show that the overall distribution of mood clusters and agreement rates from MIREX and MTurk were comparable. However, Turkers tended to agree less with the pre-labeled mood clusters than MIREX evaluators. The system evaluation results generated using both sets of data were mostly the same except for detecting one statistically significant pair using Friedman's test. We conclude that MTurk can potentially serve as a viable alternative for ground truth collection, with some reservation with regards to particular mood clusters.


Content-based layouts for exploratory metadata search in scientific research data BIBAFull-Text 139-148
  Jürgen Bernard; Tobias Ruppert; Maximilian Scherer; Jörn Kohlhammer; Tobias Schreck
Today's digital libraries (DLs) archive vast amounts of information in the form of text, videos, images, data measurements, etc. User access to DL content can rely on similarity between metadata elements, or similarity between the data itself (content-based similarity). We consider the problem of exploratory search in large DLs of time-oriented data. We propose a novel approach for overview-first exploration of data collections based on user-selected metadata properties. In a 2D layout representing entities of the selected property are laid out based on their similarity with respect to the underlying data content. The display is enhanced by compact summarizations of underlying data elements, and forms the basis for exploratory navigation of users in the data space. The approach is proposed as an interface for visual exploration, leading the user to discover interesting relationships between data items relying on content-based similarity between data items and their respective metadata labels. We apply the method on real data sets from the earth observation community, showing its applicability and usefulness.
Refactoring HUBzero for linked data BIBAFull-Text 149-152
  Michael Witt; Yongyang Yu
The HUBzero cyberinfrastructure provides a virtual research environment that includes a set of tools for web-based, scientific collaboration and a platform for publishing and using resources such as executable software, source code, images, learning modules, videos, documents, and datasets. Released as open source in 2010, HUBzero has been implemented on a typical LAMP stack (Linux, Apache, MySQL, and PHP) and utilizes the Joomla! content management system. This paper describes the subsequent refactoring of HUBzero to produce and expose Linked Data from its backend, relational database, altering the external expression of the data without changing its internal structure. The Open Archives Initiative Object Reuse and Exchange (OAI-ORE) specification is applied to model the basic structural semantics of HUBzero resources as Nested Aggregations, and data and metadata are mapped to vocabularies such as Dublin Core and published within the web representations of the resources using RDFa. Resource Maps can be harvested using an RDF crawler or an OAI-PMH data provider that were bundled for demonstration purposes. A visualization was produced to browse and navigate the relations among data and metadata from an example hub.
Treating data like software: a case for production quality data BIBAFull-Text 153-156
  Jennifer M. Schopf
In this short paper, we describe the production data approach to data curation. We argue that by treating data in a similar fashion to how we build production software, that data will be more readily accessible and available for broad re-use. We should be treating data as an ongoing process. This includes considering third-party contributions; planning for cyclical releases; bug fixes, tracking, and versioning; and issuing licensing and citation information with each release.
A quantitative evaluation of techniques for detection of abnormal change events in blogs BIBAFull-Text 157-166
  Paul L. Bogen; Richard Furuta; Frank Shipman
While most digital collections have limited forms of change -- primarily creation and deletion of additional resources -- there exists a class of digital collections that undergoes additional kinds of change. These collections are made up of resources that are distributed across the Internet and brought together into a collection via hyperlinking. Resources in these collections can be expected to change as time goes on. Part of the difficulty in maintaining these collections is determining if a changed page is still a valid member of the collection. Others have tried to address this problem by measuring change and defining a maximum allowed threshold of change, however, these methods treat all change as a potential problem and treat web content as a static document despite its intrinsically dynamic nature. Instead, we approach the significance of change on the web as a normal part of a web document's life-cycle and determine the difference between what a maintainer expects a page to do and what it actually does. In this work we evaluate the different options for extractors and analyzers in order to determine the best options from a suite of techniques. The evaluation used a human-generated ground-truth set of blog changes. The results of this work showed a statistically significant improvement over a range of traditional threshold techniques when applied to our collection of tagged blog changes.

Named entities

Similar researcher search in academic environments BIBAFull-Text 167-170
  Sujatha Das Gollapalli; Prasenjit Mitra; C. Lee Giles
Entity search is an emerging IR and NLP task that involves the retrieval of entities of a specific type in response to a query. We address the similar researcher search" or the "researcher recommendation" problem, an instance of similar entity search" for the academic domain. In response to a researcher name' query, the goal of a researcher recommender system is to output the list of researchers that have similar expertise as that of the queried researcher. We propose models for computing similarity between researchers based on expertise profiles extracted from their publications and academic homepages. We provide results of our models for the recommendation task on two publicly-available datasets. To the best of our knowledge, we are the first to address content-based researcher recommendation in an academic setting and demonstrate it for Computer Science via our system, ScholarSearch.
An analysis of the named entity recognition problem in digital library metadata BIBAFull-Text 171-174
  Nuno Freire; José Borbinha; Pável Calado
Information resources in digital libraries are usually described, along with their context, by structured data records, commonly referred as metadata. Those records often contain unstructured information in natural language text, since they typically follow a data model which defines generic semantics for its data elements, or includes data elements modeled to contain free text. The information contained in these data elements, although machine readable, resides in unstructured natural language texts that are difficult to process by computers. This paper addresses a particular task of information extraction, typically called named entity recognition, which deals with the references to entities made by names occurring in the texts. This paper presents the results of a study of how the named entity recognition problem manifests itself in digital library metadata. In particular, we present the main differences between performing named entity recognition in natural language and in the text within metadata. The paper finalizes with a novel approach for named entity recognition in metadata.
Active associative sampling for author name disambiguation BIBAFull-Text 175-184
  Anderson A. Ferreira; Rodrigo Silva; Marcos André Gonçalves; Adriano Veloso; Alberto H. F. Laender
One of the hardest problems faced by current scholarly digital libraries is author name ambiguity. This problem occurs when, in a set of citation records, there are records of a same author under distinct names, or citation records belonging to distinct authors with similar names. Among the several proposed methods, the most effective ones seem to be based on the direct assignment of the records to their respective authors by means of the application of supervised machine learning techniques. The effectiveness of such methods is usually directly correlated with the amount of supervised training data available. However, the acquisition of training examples requires skilled human annotators to manually label references. Aiming to reduce the set of examples needed to produce the training data, in this paper we propose a new active sampling strategy based on association rules for the author name disambiguation task. We compare our strategy with state-of-the-art supervised baselines that use the complete labeled training dataset and other active methods and show that very competitive results in terms of disambiguation effectiveness can be obtained with reductions in the training set of up to 71%.
AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries BIBAFull-Text 185-194
  Madian Khabsa; Pucktada Treeratpituk; C. Lee Giles
Acknowledgments are widely used in scientific articles to express gratitude and credit collaborators. Despite suggestions that indexing acknowledgments automatically will give interesting insights, there is currently, to the best of our knowledge, no such system to track acknowledgments and index them. In this paper we introduce AckSeer, a search engine and a repository for automatically extracted acknowledgments in digital libraries. AckSeer is a fully automated system that scans items in digital libraries including conference papers, journals, and books extracting acknowledgment sections and identifying acknowledged entities mentioned within. We describe the architecture of AckSeer and discuss the extraction algorithms that achieve a F1 measure above 83%. We use multiple Named Entity Recognition (NER) tools and propose a method for merging the outcome from different recognizers. The resulting entities are stored in a database then made searchable by adding them to the AckSeer index along with the metadata of the containing paper/book.
   We build AckSeer on top of the documents in CiteSeerX digital library yielding more than 500,000 acknowledgments and more than 4 million mentioned entities.

Books and reading

Learning topics and related passages in books BIBAFull-Text 195-198
  David Newman; Youn Noh; Kat Hagedorn; Arun Balagopalan
The number of books available online is increasing, but user interfaces may not be taking full advantage of advances in machine learning techniques that could help users navigate, explore, discover and understand interesting and useful content in books. Using a group of ten students and over one thousand crowdsourced judgments, we conducted multiple user studies to evaluate topics and related passages in books, all learned by topic modeling. Using ten books, selected from humanities (e.g. Plato's Republic), social sciences (e.g. Marx's Capital) and sciences (e.g. Einstein's Relativity), and four different evaluation experiments, we show that users agree that the learned topics are coherent and important to the book, and related to the automatically generated passages. We show how crowdsourced evaluations are useful, and can complement more focused evaluations using students who have studied the texts. This work provides a framework for (1) learning topics and related passages in books, and (2) evaluating those learned topics and passages, and moves one step toward automatic annotation to support topic navigation of books.
Emphasis on examining results in fiction searches contributes to finding good novels BIBAFull-Text 199-202
  Suvi Oksanen; Pertti Vakkari
We studied how an enriched public library catalogue is used to access novels. 58 users searched for interesting novels to read in a simulated situation where they had only a vague idea of what they would like to read. Data consist of search logs, pre and post search questionnaires and observations. Results show, that investing effort on examining results improves search success, i.e. finding interesting novels, whereas effort in querying has no bearing on it. In designing systems for fiction retrieval, enriching result presentation with detailed book information would benefit users.
The "City of Lit" digital library: a case study of interdisciplinary research and collaboration BIBAFull-Text 203-212
  Haowei Hsieh; Bridget Draxler; Nicole J. Dudley; Jon Winet
In 2008, Iowa City was designated as one of only five "Cities of Literature" worldwide by UNESCO. To take advantage of our rich local literary history, an interdisciplinary research team from the University of Iowa collaborated to develop a digital library featuring Iowa City authors and locations. The UNESCO City of Literature digital library (referred to internally as "City of Lit") consists of a mobile application for the general public to access the database and a set of web-based interfaces for researcher and content creators to contribute to the database. Members of the research team have developed undergraduate literature courses to study the feasibility of using young scholars for digital content creation, and the pedagogical effect of including digital research in traditional literary courses. Students in the courses were trained to conduct scholarly research and generate a variety of digital resources to be included in the digital collection. This paper reports our experience building the City of Lit digital library and the results from evaluations and studies of the students in the courses. We also outline the implementation and development of the digital library, its framework, and the client-side mobile application.
Student researchers, citizen scholars and the trillion word library BIBAFull-Text 213-222
  Gregory Crane; Bridget Almas; Alison Babeu; Lisa Cerrato; Matthew Harrington; David Bamman; Harry Diakoff
The surviving corpora of Greek and Latin are relatively compact but the shift from books and written objects to digitized texts has already challenged students of these languages to move away from books as organizing metaphors and to ask, instead, what do you do with a billion, or even a trillion, words? We need a new culture of intellectual production in which student researchers and citizen scholars play a central role. And we need as a consequence to reorganize the education that we provide in the humanities, stressing participatory learning, and supporting a virtuous cycle where students contribute data as they learn and learn in order to contribute knowledge. We report on five strategies that we have implemented to further this virtuous cycle: (1) reading environments by which learners can work with languages that they have not studied, (2) feedback for those who choose to internalize knowledge about a particular language, (3) methods whereby those with knowledge of different languages can collaborate to develop interpretations and to produce new annotations, (4) dynamic reading lists that allow learners to assess and to document what they have mastered, and (5) general e-portfolios in which learners can track what they have accomplished and document what they have contributed and learned to the public or to particular groups.

Concepts and topics

Event-centric search and exploration in document collections BIBAFull-Text 223-232
  Jannik Strötgen; Michael Gertz
Textual data ranging from corpora of digitized historic documents to large collections of news feeds provide a rich source for temporal and geographic information. Such types of information have recently gained a lot of interest in support of different search and exploration tasks, e.g., by organizing news along a timeline or placing the origin of documents on a map. However, for this, temporal and geographic information embedded in documents is often considered in isolation. We claim that through combining such information into (chronologically ordered) event-like features interesting and meaningful search and exploration tasks are possible. In this paper, we present a framework for the extraction, exploration, and visualization of event information in document collections. For this, one has to identify and combine temporal and geographic expressions from documents, thus enriching a document collection by a set of normalized events. Traditional search queries then can be enriched by conditions on the events relevant to the search subject. Most important for our event-centric approach is that a search result consists of a sequence of events relevant to the search terms and not just a document hit-list. Such events can originate from different documents and can be further explored, in particular events relevant to a search query can be ordered chronologically. We demonstrate the utility of our framework by different (multilingual) search and exploration scenarios using a Wikipedia corpus.
Dynamic online views of meta-indexes BIBAFull-Text 233-236
  Michael Huggett; Edie Rasmussen
For a collection of digitized monographs in a subject domain, a domain meta-index provides a summary of domain concepts, and a structured vocabulary to support a scholar's navigation and search. We present a prototype of a Meta-index User Interface (MUI) that provides views of a domain at three levels: summarizing and comparing domains, exposing the regularities of a domain's vocabulary, and displaying book information and page content related both to objectively-representative books, and to specific user searches.
Topic models for taxonomies BIBAFull-Text 237-240
  Anton Bakalov; Andrew McCallum; Hanna Wallach; David Mimno
Concept taxonomies such as MeSH, the ACM Computing Classification System, and the NY Times Subject Headings are frequently used to help organize data. They typically consist of a set of concept names organized in a hierarchy. However, these names and structure are often not sufficient to fully capture the intended meaning of a taxonomy node, and particularly non-experts may have difficulty navigating and placing data into the taxonomy. This paper introduces two semi-supervised topic models that automatically augment a given taxonomy with many additional keywords by leveraging a corpus of multi-labeled documents. Our experiments show that users find the topics beneficial for taxonomy interpretation, substantially increasing their cataloging accuracy. Furthermore, the models provide a better information rate compared to Labeled LDA.
Concept chaining utilizing meronyms in text characterization BIBAFull-Text 241-248
  Lori Watrous-deVersterre; Chong Wang; Min Song
For most, the web is the first source to answer a question formulated by curiosity, need, or research reasons. This phenomenon is due to the internet's ubiquitous access, ease of use, and the extensive and ever expanding content. The problem is no longer the need to acquire content to encourage use, but to provide organizational tools to support content categorization that will facilitate improved access methods. This paper presents the results of a new text characterization algorithm that combines semantic and linguistic techniques utilizing domain-based ontology background knowledge. It explores the combination of meronym, synonym, and hypernym linguistic relationships to create a set of concept chains used to represent concepts found in a document. The experiments show improved accuracy over bag-of-words based term weighting methods and reveal characteristics of the meronym contribution to document representation.


Improving multi-faceted book search by incorporating sparse latent semantic analysis of click-through logs BIBAFull-Text 249-258
  Deng Yi; Yin Zhang; Haihan Yu; Yanfei Yin; Jing Pan; Baogang Wei
Multi-faceted book search engine presents diverse category-style options to allow users to refine search results without re-entering a query. In this paper, we propose a novel multi-faceted book search engine that utilizes users' query-related latent intents mined from click-through logs as multiple facets for books. The latent query intents can be effectively and efficiently discovered by applying the Sparse Latent Semantic Analysis (LSA) model to users' query and clicking behaviors in the click-through logs. This paper presents the details to improve the multi-faceted book search by incorporating the compact representation of query-intent-book relationships generated by Sparse LSA into the off-line and online processing procedures. The specificity of latent query intents can be flexibly changed by adjusting the sparsity level of projection matrix in the Sparse LSA model. We evaluated our approach on CADAL click-through logs containing 45,892 queries and 164,822 books. The experimental results show the Sparse LSA model with more sparse projection matrix tends to discover the more specific latent query intents. The latent query intents suggested by our approach usually gain the high user satisfaction ratio.
Personalized query expansion in the QIC system BIBAFull-Text 259-262
  Prat Tanapaisankit; Lori Watrous-deVersterre; Min Song
Query In Context (QIC) is a personalized search system that enhances individual search by incorporating user preferences in query expansion, capturing meanings embedded in documents, and ranking search results with context-enriched features. In this paper, we propose a new technique for QIC's Query Expansion module, which reformulates user queries by using novel statistical-based and knowledge-based query expansion techniques to improve the returned results. The promising preliminary results analyzed through precision and recall metrics show better alignment between the user's interests and the results retrieved.
Investigating keyphrase indexing with text denoising BIBAFull-Text 263-266
  Rushdi Shams; Robert E. Mercer
In this paper, we report on indexing performance by a state-of-the-art keyphrase indexer, Maui, when paired with a text extraction procedure called text denoising. Text denoising is a method that extracts the denoised text, comprising the content-rich sentences, from full texts. The performance of the keyphrase indexer is demonstrated on three standard corpora collected from three domains, namely food and agriculture, high energy physics, and biomedical science. Maui is trained using the full texts and denoised texts. The indexer, using its trained models, then extracts keyphrases from test sets comprising full texts, and their denoised and noise parts (i.e., the part of texts that remains after denoising). Experimental findings show that against a gold standard, the denoised-text-trained indexer indexing full texts, performs either better than or as good as its benchmark performance produced by a full-text-trained indexer indexing full texts.
Exploiting real-time information retrieval in the microblogosphere BIBAFull-Text 267-276
  Feng Liang; Runwei Qiang; Jianwu Yang
Information seeking behavior in microblogging environments such as Twitter differs from traditional web search. The best performing microblog retrieval techniques attempt to utilize both semantic and temporal aspects of documents. In this paper, we present an effective approach, including the query modeling, the document modeling and the temporal re-ranking, to discover the most recent but relevant information to the query. For the query modeling, we introduce a two-stage pseudo-relevance feedback query expansion to overcome the severe vocabulary-mismatch problem of short message retrieval in microblog. For the document modeling, we propose two ways to expand document with the help of the shortened URL. For the temporal re-ranking, we suggest several methods to evaluate the temporal aspects of documents. Experimental results demonstrate that our approach obtains significant improvements compared with baseline systems. Specifically, the proposed system gives 26.37% and 9.94% further increases in P@30 and MAP over the best performing result on highrel in the TREC'11 Real-Time Search Task.


Improving algorithm search using the algorithm co-citation network BIBAFull-Text 277-280
  Suppawong Tuarob; Prasenjit Mitra; C. Lee Giles
Algorithms are an essential part of computational science. An algorithm search engine, which extracts pseudo-codes and their metadata from documents, and makes it searchable, has recently been developed as part of the CiteSeerX suite. However, this algorithm search engine only retrieves and ranks relevant algorithms solely on textual similarity. Here, we propose a method for using the algorithm co-citation network to infer the similarity between algorithms. We apply a graph clustering algorithm on the network for algorithm recommendation and make suggestions on how to improve the current CiteSeerX algorithm search engine.
Evaluating and ranking patents using weighted citations BIBAFull-Text 281-284
  Sooyoung Oh; Zhen Lei; Prasenjit Mitra; John Yen
Citation counts have been widely used in a digital library for purposes such as ranking scientific publications and evaluating patents. This paper demonstrates that distinguishing different types of citations could rank better for these purposes. We differentiate patent citations along two dimensions (assignees and technologies) into four types, and propose a weighted citation approach for assessing and ranking patents. We investigate five weight learning methods and compare their performance. Our weighted citation method performs consistently better than simple citation counts, in terms of rank correlations with patent renewal status. The estimated weights on different citations are consistent with economic insights on patent citations. Our study points to an interesting and promising research line on patent citation and network analysis that has not been explored.
A hybrid two-stage approach for discipline-independent canonical representation extraction from references BIBAFull-Text 285-294
  Sung Hee Park; Roger W. Ehrich; Edward A. Fox
In education and research, references play a key role. However, extracting and parsing references are difficult problems. One concern is that there are many styles of references; hence, given a surface form, identifying what style was employed is problematic, especially in heterogeneous collections of theses and dissertations, which cover many fields and disciplines, and where different styles may be used even in the same publication. We address these problems by drawing upon suitable knowledge found in the WWW. In particular, we research a two-stage classifier approach, involving multi-class classification with respect to reference styles, and partially solve the problem of parsing surface representations of references. We describe empirical evidence for the effectiveness of our approach and plans for improvement of our methods.
Web-based citation parsing, correction and augmentation BIBAFull-Text 295-304
  Liangcai Gao; Xixi Qi; Zhi Tang; Xiaofan Lin; Ying Liu
Considering the tremendous value of citation metadata, many methods have been proposed to automate Citation Metadata Extraction (CME). The existing methods primarily rely on the content analysis of citation text. However, the results from such content-based methods are often unreliable. Moreover, the extracted citation metadata is only a small part of the relevant metadata that spreads across the Internet. As opposed to the content-based CME methods, this paper proposes a Web-based CME approach and a citation enriching system, called as BibAll, which is capable of correcting the parsing results of content-based CME methods and augmenting citation metadata by leveraging relevant bibliographic data from digital repositories and cited-by publications on the Web. BibAll consists of four main components: citation parsing, Web-based bibliographic data retrieval, irrelevant bibliographic data filtering, and relevant bibliographic data integration. The system has been tested on the publicly available FLUX-CIM dataset. Experimental results show that BibAll significantly improves the citation parsing accuracy and augments the metadata of the original citation.

User behavior

Book selection behavior in the physical library: implications for ebook collections BIBAFull-Text 305-314
  Annika Hinze; Dana McKay; Nicholas Vanderschantz; Claire Timpany; Sally Jo Cunningham
Little is known about how readers select books, whether they be print books or ebooks. In this paper we present a study of how people select physical books from academic library shelves. We use the insights gained into book selection behavior to make suggestions for the design of ebook-based digital libraries in order to better facilitate book selection behavior.
How do people organize their photos in each event and how does it affect storytelling, searching and interpretation tasks? BIBAFull-Text 315-324
  Jesse Prabawa Gozali; Min-Yen Kan; Hari Sundaram
This paper explores photo organization within an event photo stream, i.e. the chronological sequence of photos from a single event. The problem is important: with the advent of inexpensive, easy-to-use photo capture devices, people can take a large number of photos per event. A family trip, for example, may include hundreds of photos. In this work, we have developed a photo browser that uses automatically segmented groups of photos -- referred to as chapters -- to organize such photos. The photo browser also affords users with a drag-and-drop interface to refine the chapter groupings.
   We conducted an exploratory study of 23 college students with their 8096 personal photos from 92 events, to understand the role of different spatial organization strategies in our chapter-based photo browser, in performing storytelling, photo search and photo set interpretation tasks. We also report novel insights on how the subjects organized their photos into chapters. We tested three layout strategies: bi-level, grid-stacking and space-filling, against a baseline plain grid layout. We found that subjects value the chronological order of the chapters more than maximizing screen space usage and that they value chapter consistency more than the chronological order of the photos. For automatic chapter groupings, having low chapter boundary misses is more important than having low chapter boundary false alarms; the choice of chapter criteria and granularity for chapter groupings are very subjective; and subjects found that chapter-based photo organization helps in all three tasks of the user study. Users preferred the chapter-based layout strategies to the baseline at a statistically significant level, with the grid-stacking strategy preferred the most.
Co-reading: investigating collaborative group reading BIBAFull-Text 325-334
  Jennifer Pearson; Tom Owen; Harold Thimbleby; George R. Buchanan
Collaborative reading, or co-reading as we call it, is ubiquitous; it occurs, for instance, in classrooms, book-clubs, and in less coordinated ways through mass media. While individual digital reading has been the subject of much investigation, research into co-reading is scarce. We report a two-phase field study of group reading to identify an initial set of user requirements. A co-reading interface is then designed that facilitates the coordination of group reading by providing temporary 'Point-out' markers to indicate specific locations within documents. A user study compared this new system with collaborative reading on paper, with a positive outcome; the differences in user behavior between paper and the new interface reveal intriguing insights into user needs and the potential benefits of digital media for co-reading.


A digital library for water main break identification and visualization BIBAFull-Text 335-336
  Sunshin Lee; Noha Elsherbiny; Edward A. Fox
This paper describes a prototype of a digital library for water main break identification and visualization. Many utilities rely on an emergency call to detect water main breaks, because breaks are difficult to predict. Collecting the information by call requires time consuming human efforts. Furthermore, it is not archived and not shared with others. Collecting and archiving the information by tweets, news, and web resources helps users to identify relevant water main breaks efficiently. In developing this prototype, we extracted location information from text instead of using GPS data. We also describe the importance of tweet visualization by location, and how we visualize tweets on a map.
A preliminary analysis of FRBR's bibliographic relationships for path based associative rules BIBAFull-Text 337-338
  Ya-Ning Chen; Hui-Pin Chen; Fei-Yen Tu
The Functional Requirements for Bibliographic Records (hereafter FRBR) has been adopted to address the relationships for bibliographic records and the related aggregate works. However, an approach to transform FRBR-based bibliographic relationships and their patterns into path-based rules for retrieval, navigation, display and data mining in the bibliographic space is still lacking. This study used the FRBR as a basis to analyze bibliographic relationships and their path-based rules. The novel "Harry Potter and the Philosopher's Stone" was used as a case study. Up until now, 87 unique records were retrieved from OCLC's Open WorldCat for analysis. Two specialists in library and information science familiar with FRBR conducted in-depth analysis to achieve inter-reliability agreement. This study generalizes several patterns of path-based rules for associating bibliographic records and outlines related issues for future study.
A qualitative analysis of information dissemination through Twitter in a digital library BIBAFull-Text 339-340
  Hae Min Kim; Christopher C. Yang; Eileen G. Abels; Mi Zhang
This study examines the use of Twitter in a digital library, the Internet Public Library (ipl2), to understand the content and dissemination patterns of Twitter messages posted by the ipl2. We conducted a content analysis on ipl2's messages on Twitter to develop a categorization of the type of tweets, and examined retweets and the active users who retweeted ipl2 tweets. We present our analysis of four areas related to the tweets: motivation, content, audience, and sources. Active users are categorized into eight groups. The research findings contribute to a further understanding of the actual use of Twitter in a digital library.
A study of automation from seed URL generation to focused web archive development: the CTRnet context BIBAFull-Text 341-342
  Seungwon Yang; Kiran Chitturi; Gregory Wilson; Mohamed Magdy; Edward A. Fox
In the event of emergencies and disasters, massive amounts of web resources are generated and shared. Due to the rapidly changing nature of those resources, it is important to start archiving them as soon as a disaster occurs. This led us to develop a prototype system for constructing archives with minimum human intervention using the seed URLs extracted from tweet collections. We present the details of our prototype system. We applied it to five tweet collections that had been developed in advance, for evaluation. We also identify five categories of non-relevant files and conclude with a discussion of findings from the evaluation.
A system for indexing tables, algorithms and figures BIBAFull-Text 343-344
  Pradeep B. Teregowda; Madian Khabsa; Clyde L. Giles
Indexing objects such as documents, figures, tables and algorithms in a single system presents challenges in schema mapping, detecting overlapping objects in documents, presenting results from such an system to users. We propose a federated approach to indexing and retrieving such objects in academic papers.
A technique for suggesting related Wikipedia articles using link analysis BIBAFull-Text 345-346
  Christopher Markson; Min Song
With more than 3.7 million articles, Wikipedia has become an important social medium for sharing knowledge. However, with this enormous repository of information, it can often be difficult to locate fundamental topics that support lower-level articles. By exploiting the information stored in the links between articles, we propose that related companion articles can be automatically generated to help further the reader's understanding of a given topic. This approach to a recommendation system uses tested link analysis techniques to present users with a clear path to related high-level articles, furthering the understanding of low-level topics.
An exploration of the research trends in the digital library evaluation domain BIBAFull-Text 347-348
  Giannis Tsakonas; Angelos Mitrelis; Leonidas Papachristopoulos; Christos Papatheodorou
Evaluation is a vital research area in the digital library domain, demonstrating a growing literature in conference and journal papers. In this poster we present the research trends that governed the field within the decade 2001-2010 in the JCDL and ECDL conferences. The DL evaluation literature was annotated using the domain ontology DiLEO, which defines explicitly the main concepts of the digital library evaluation field and their correlations. Several findings from this study underline the persistent character of quantitative research in evaluation initiatives.
An iterative reliability measure for semi-anonymous annotators BIBAFull-Text 349-350
  Peter Organisciak
This study addresses problems of reliability in the creation of tagged corpora by self-selected semi-anonymous raters. In order to account for both strong and weak raters, this paper contributes a recursive technique for scoring rater reliability. By assigning raters trust scores in the proposed method, candidate labels can be weighted by a confidence score and low-confidence ratings can be routed to an expert rater or additional amateur raters for further action.
An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space BIBAFull-Text 351-352
  Shoaib Jameel; Wai Lam; Xiaojun Qian; Ching-man Au Yeung
Search results of the existing general-purpose search engines usually do not satisfy domain-specific information retrieval tasks as there is a mis-match between the technical expertise of a user and the results returned by the search engine. In this paper, we investigate the problem of ranking domain-specific documents based on the technical difficulty. We propose an unsupervised conceptual terrain model using Latent Semantic Indexing (LSI) for re-ranking search results obtained from a similarity based search system. We connect the sequences of terms under the latent space by the semantic distance between the terms and compute the traversal cost for a document indicating the technical difficulty. Our experiments on a domain-specific corpus demonstrate the efficacy of our method.
Longitudinal analysis of historical texts' readability BIBAFull-Text 353-354
  Adam Jatowt; Katsumi Tanaka
Digital libraries often contain historical documents of varying age. The degree to which users can understand their content depends much on their reading difficulty. In this poster paper we report the results of our studies on the readability of historical documents from the viewpoint of present users. We investigate the correlation between the outcomes of different readability measurements and publication dates of prose texts on the basis of two datasets, the Victorian Women's Writers Project and the Corpus of Late Modern English Texts.
Bi2SoN: a digital library for supporting biomedical research BIBAFull-Text 355-356
  Benjamin Köhncke; Sascha Tönnies; Wolf-Tilo Balke
In the domain of biology a huge amount of different data sources is available. Therefore, information gathering and searching are challenging tasks. To avoid a manual assessment of all relevant data sources, their knowledge has to be integrated. The presented system focuses on all aspects needed for suitable data integration and retrieval for domain experts from the field of biology. The knowledge from different data sources is combined and further used for, e.g. synonym enrichment of the query term. The resulting prototype was presented to a group of domain experts who confirmed that the system delivers suitable results supporting the scientists by their literature search.
CADAL digital calligraphy system BIBAFull-Text 357-358
  Pengcheng Gao; Jiangqin Wu; Yang Xia; Yuan Lin
CADAL (China Academic Digital Associate Library) plays a primary role in Universal Digital Library. By the end of 2011, CADAL has digitized 1.85 million books. Chinese calligraphy occupies an important place in Chinese culture, and the collection of digitized Chinese calligraphy is the large part of CADAL resources. So, the services of making full use of the collections are required for diverse users, such as art historians, students and the public. Here we propose a CADAL Digital Calligraphy System, in which over 1,100 works and 100,000 characters are included, the services of multi-level metadata-based search (metadata-based books search, works search and characters search) and multi-grain calligraphic character search (content-based search and radical-based search) are provided. In the end, some search-related applications of CADAL Digital Calligraphy System are discussed.
Characterize scientific domain and domain context BIBAFull-Text 359-360
  Jinsong Zhang; Chun Guo; Xiaozhong Liu
Domain knowledge map construction as an important method can describe the significant characters of a selected domain. In this research, we will address three problems for knowledge graph generation. Firstly, this paper will construct domain (core journals and conference proceedings) knowledge and domain context (domain citation) knowledge graphs, and propose a novel method to integrate those graphs. Secondly, two different methods will be investigated to associate keywords on the graph: Co-occur Domain Distance and Citation Probability Distribution Distance. Last but not least, the paper will propose an innovative method to evaluate the accuracy and coverage of knowledge graphs based on training keyword oriented Labeled-LDA model and validate different domain or domain context graphs.
Collaboration and communication tools used by the biodiversity heritage library BIBAFull-Text 361-362
  Trish-Rose Sandler; Constance Rinaldo; Keri Thompson; William Ulate; Martin Kalfatovic
Through the application of multiple strategies and tools, the Biodiversity Heritage Library has created an effective and collaborative multi-institutional virtual organization. The purpose of this paper is to explore the communication and collaboration strategies used by the BHL to create, maintain, and provide open access to its corpus of biodiversity literature. BHL, in its seventh year, is a mature service and no longer a pilot project. Largely driven from the ground up, and without any institutional mandate, the BHL has successfully and organically fostered an organizational model that has encouraged innovation, user engagement, and global expansion.
Data determination, disambiguation, and referencing in molecular biology BIBAFull-Text 363-364
  Shuheng Wu; Besiki Stvilia; Dong Joon Lee
Entity and instance determination, disambiguation, and referencing, referred to as authority control in libraries, are essential for scientific research. This study examines the authority control practices and issues in molecular biology using literature and scenario analyses. The analyses imply that the concept of authority control in molecular biology is associated with three tasks: named entity recognition, disambiguation, and unification. The identified authority control issues were conceptualized as quality problems caused by four sources: inconsistent or incomplete mapping, context changes, entity changes, and changes in entity metadata. This study can inform librarians and repository curators of the needs and issues of authority control in molecular biology and other related disciplines.
Digital libraries for computational journalism BIBAFull-Text 365-366
  Luis Francisco-Revilla
Computational journalism is driving the evolution of news media, devising new ways for collecting and analyzing large numbers of digital artifacts such as tweets and memes. This paper presents Breadcrumbs PDL, a specialized Personal Digital Library system that helps readers and journalists to access and use a collection of user-detected memes. PDL is part of Project Breadcrumbs, which aims to capitalize on public participation in the news media cycle. PDL supports browsing and exploration, and supports recommendations services that suggest alternative memes to read and ways to organize the users' personal workspace. Based on the users' clipping and organizational behaviors and textual similarities between clips, PDL can infer relationships between memes that computers alone cannot easily detect.
Comparison of three digital library interfaces: open library, Google books, and Hathi Trust BIBAFull-Text 367-368
  Matthew Miller; Gilok Choi; Lindsay Chell
Digital libraries often require very specialized interfaces in order to present various types of digital content. It is therefore critical to create interfaces that improve presentation of digital information and maximize user experience with digital collections. In this respect, this research aims to examine interfaces of three digital libraries that provide collections of digital text. The three digital libraries include Open Library, Google Books, and Hathi Trust. An evaluation matrix was developed to measure usability, aesthetics and interface components. The overall findings of the study showed that the majority of the participants preferred the Open Library interface followed by Google Books. The statistical analysis indicated that Open Library is significantly better than Google Books and Hathi Trust in terms of usability, aesthetics, and interface components. The preference for the Open Library stemmed largely from aesthetic choices. Participants also appreciated the use of elements that are analogous to their physical counterparts.
Digital preservation in a box: outreach resources for digital stewardship BIBAFull-Text 369-370
  Butch Lazorchak; Susan Manus; Dever Powell; Jane Zhang
"Digital Preservation in a Box" is a major activity of the National Digital Stewardship Alliance (NDSA) Outreach Working Group. This toolkit of digital stewardship outreach resources can be utilized by diverse communities as a gentle introduction to the concepts of preserving digital information.
Distinguishing venues by writing styles BIBAFull-Text 371-372
  Zaihan Yang; Brian D. Davison
A principal goal for most research scientists is to publish. There are different kinds of publications covering different topics and requiring different writing formats. While authors tend to have unique personal writing styles, no work has been carried out to find out whether publication venues are distinguishable by their writing styles. Our work takes the first step into exploring this problem. Using the traditional classification approach and carrying out experiments on real data from the CiteSeer digital library, we demonstrate that venues are also distinguished by their writing styles.
Do public library websites consider the disabled or senior citizens? BIBAFull-Text 373-374
  Yong Jeong Yi; Ji Hei Kang
The issues of mobility and sight impairment with respect to virtual accessibility are as important as physical accessibility when it comes to using public library services. However, few studies have discussed public library website accessibility from the perspective of underrepresented user groups. The purpose of this study is to evaluate the accessibility of websites of public libraries and further identify the association between accessibility and public libraries' budgets. The study selected 20 public library systems that have the highest percentages of the disabled or senior citizen patrons. The study employed the Pearson correlation test in order to investigate the correlation between the accessibility and the budgets of public libraries. Preliminary findings show that most current public library websites do not comply with the Section 508. The findings indicate that public libraries did not consider their users or potential users with physical disabilities when designing their websites. Therefore, the findings suggest that public library websites are not suited to deliver effective information services for underrepresented user populations who need special assistance. Furthermore, this study finds that there is no significant association between the public library websites' accessibility and the budgets.
Electronic records processing: it's a CINCH! BIBAFull-Text 375-376
  Amy Rudersdorf; Dean Farrell; Lisa Gregory
In August 2011, five project partners (the State Library of North Carolina, the North Carolina State Archives, North Carolina Libraries for Virtual Education, Elon University, and the University of North Carolina at Charlotte) began a collaboration to develop a computer application that collects, ingests, and authenticates the electronic records that libraries and archives are often mandated to maintain. The application, called "CINCH," incorporates existing digital curation technologies, but adds to their functionality by creating a pull-down (or capture) utility to gather content available from the Internet. The final product will be a lightweight, open-source software tool that institutions required to collect and authenticate records on ingest can employ to retrieve and process their digital content.
Exploiting canonical structures to transmit complex objects from a digital library to a portal BIBFull-Text 377-378
  Scott Britell; Lois Delcambre; Lillian Cassel; Edward Fox; Richard Furuta
Global web archive integration with memento BIBAFull-Text 379-380
  Robert Sanderson
In this poster, we describe the approach taken to designing and implementing a tera-scale multi-repository index of archived web resources using massively parallel processing.
GROTOAP: ground truth for open access publications BIBAFull-Text 381-382
  Dominika Tkaczyk; Artur Czeczko; Krzysztof Rusek; Lukasz Bolikowski; Roman Bogacewicz
The field of digital document content analysis includes many important tasks, for example page segmentation or zone classification. It is impossible to build effective solutions for such problems and evaluate their performance without a reliable test set, that contains both input documents and expected results of segmentation and classification. In this paper we present GROTOAP -- a test set useful for training and performance evaluation of page segmentation and zone classification tasks. The test set contains input articles in a digital form and corresponding ground truth files. All input documents included in the test set have been selected from DOAJ database, which indexes articles published under CC-BY license. The whole test set is available under the same license.
Has it been already digitized?: how to find information about digitized documents BIBAFull-Text 383-384
  Tomas Foltyn; Martin Lhotak; Pavel Kocourek
The Digitization Registry of the Czech Republic is a research project the aim of which is to create a national registry of digitized documents to avoid unwanted duplications in the digitization as well as to share digitization results throughout the Czech Republic. This could make the digitization more effective and also economize financial resources of participating institutions.
How can spreaders affect the indirect influence on Twitter? BIBAFull-Text 385-386
  Xin Shuai; Ying Ding; Jerome Busemeyer
Most studies on social influence have focused on direct influence, while another interesting question can be raised as whether indirect influence exists between two users who're not directly connected in the network and what affects such influence. In addition, the theory of complex contagion tells us that more spreaders will enhance the indirect influence between two users. Our observation of intensity of indirect influence, propagated by n parallel spreaders and quantified by retweeting probability on Twitter, shows that complex contagion is validated globally but is violated locally. In other words, the retweeting probability increases non-monotonically with some local drops.
Improving a hybrid literary book recommendation system through author ranking BIBAFull-Text 387-388
  Paula Cristina Vaz; David Martins de Matos; Bruno Martins; Pavel Calado
Literary reading is an important activity for individuals and can be a long term commitment, making book choice an important task for book lovers and public library users. In this paper, we present a hybrid recommendation system to help readers decide which book to read next. We study book and author recommendations in a hybrid recommendation setting and test our algorithm on the LitRec data set. Our hybrid method combines two item-based collaborative filtering algorithms to predict books and authors that the user will like. Author predictions are expanded into a booklist that is subsequently aggregated with the former book predictions. Finally, the resulting booklist is used to yield the top-n book recommendations. By means of various experiments, we demonstrate that author recommendation can improve overall book recommendation.
Introducing high performance computing in digital library processing workflows BIBAFull-Text 389-390
  Bill Barth; Maria Esteva; Jon Gibson; Ladd Hanson; Christopher Jordan
As larger collections need to be processed for digital library projects, libraries have to adopt technologies of scale. We present a case that involved creating image derivatives using High Performance Computing (HPC) resources. This experience opens up possibilities to conduct various processing tasks effectively and in reasonable time frames. Most importantly, it enables library IT staff access to cyberinfrastructure that can address the computing challenges of large-scale digital library projects.
Investigating user perceptions of engagement and information quality in mobile human computation games BIBAFull-Text 391-392
  Dion Goh; Khasfariyati Razikin; Chei Sian Lee; Alton Chua
We investigate user perceptions of engagement and information quality of a mobile human computation game (HCG) by comparing it against a non-game-based application. Results suggest that the mobile HCG enabled participants to occupy their leisure time but the information contributed was perceived to be not as relevant. Implications of this study are discussed.
Lessons learned from developing and evaluating a comprehensive digital library for engineering education BIBAFull-Text 393-394
  Yunlu Zhang; Alice M. Agogino; Shijun Li
Educating the engineering education community in today's digital world requires straightforward yet flexible access to high-quality educational resources. The Teach Engineering and NEEDS (National Engineering Education Delivery System) digital libraries collaborated in 2005 to create and steward the K-Gray Engineering Pathway (EP), a premier portal to comprehensive engineering and computing education resources within the greater National Science Digital Library (NSDL). We collaborated to design navigation, implement features, and find imagery that could effectively address both K-12 and higher education audiences. A system was designed to serve both target audiences, including an expanded simple search on every page to include grade/audience level search fields. This search, on all main pages, also includes a choice of learning resource type and a link to the Advanced Search with expanded search fields. EP tailored many features such as community pages and cataloging to be distinguishable by K-12 versus higher education users. Evaluation studies show that our current strength is a consistent interface with strong usability features. In this paper, we a provide retrospective and summarize our lessons learned and evaluation results, along with our directions for future research and development.
Meta-line: lineage information for improved metadata quality BIBAFull-Text 395-396
  Sascha Tönnies; Benjamin Köhncke; Wolf-Tilo Balke
Controlled content quality also in terms of indexing is one of the major advantages of using digital libraries in contrast to general Web sources or Web search engines. However, considering today's information flood the mostly manual effort in acquiring new sources and creating suitable (semantic) metadata for content indexing and retrieval is already prohibitive. A recent solution is given by automatic generation of metadata, where various methods currently become more widespread. But in this case neglecting quality assurance is even more problematic, because heuristic generation often fails and the resulting low-quality metadata will directly diminish the quality of service that a digital library provides. To address this problem, we propose a metadata quality model to determine the overall quality of a metadata set and validate individual requirements imposed on that metadata set. Furthermore, lineage information is provided to trace the quality evolution of a metadata set.
Multi-view of the ACM classification system BIBAFull-Text 397-398
  Xia Lin; Mi Zhang; Haozhen Zhao; Jan Buzydlowski
The ACM Computing Classification System (CCS) is a hierarchical classification system used to index and classify all the published literature of ACM. They reflect major areas and topics of the computing field and they often serve as an overview and navigational guide to the field. However, similar to all the traditional classification systems and subject domain thesauri, such an overview and navigational guide is static and sketchy, representing only a top-down representation of a domain. In this paper, we look into a 10-year period of ACM literature and examine how the CCS terms are actually used in the ACM digital library and how the patterns of term usages show different term relationships than those defined in the CCS. By comparing the dynamic statistical patterns of term usage with the static hierarchical structures of the terms, we show that much can be gained by integrating both of them into an interactive interface to provide better overview maps and navigational guides to the domain of computing.
National digital newspaper program: a case study in sharing, linking, and using data BIBAFull-Text 399-400
  Nathan Yarasavage; Robin Butterhof; Christopher Ehrman
This poster presents a case study describing how the National Digital Newspaper Program's (NDNP) metadata specification and public website, Chronicling America, have been designed to promote a wide range of data sharing. Through use of the website's extensive application programming interface (API) and open-source software counterpart, several institutions are benefiting from the publicly-funded program's data.
Responsibility for research data quality in open access: a Slovenian case BIBAFull-Text 401-402
  Janez Stebe
In the framework of a project aiming to realize a strategy of open research data access in Slovenia in accordance with OECD principles, we conducted a series of interviews with different target audiences in order to assess the initial conditions in the area of data handling. The data creators and data services expressed a high level of awareness about data quality issues, especially in relation to good publication potential. Barriers to ensuring the greater accessibility of data in the future include the little recognition and reputation for doing the related extra work involved in preparing data and documentation, the need for financial rewards for such additional work, and the undeveloped culture of data exchange in general. The motivation to provide open access to such data will involve a combination of requirements prescribed for data delivery, and the provision of support services and financial rewards, in particular changing the views held by the professional scientific community about the benefits of open data for research activities.
Scientific cyberlearning resources referential metadata creation via information retrieval BIBAFull-Text 403-404
  Xiaozhong Liu; Han Jia
The goal of this research is to describe an innovative method of creating scientific referential metadata for a cyberinfrastructure-enabled learning environment to enhance student and scholar learning experiences. By using information retrieval and meta-search approaches, different types of referential metadata, such as related Wikipedia Pages, Datasets, Source Code, Video Lectures, Presentation Slides, and (online) Tutorials, for an assortment of publications and scientific topics will be automatically retrieved, associated, and ranked.
Sheer curation for experimental data and provenance BIBFull-Text 405-406
  Mark Hedges; Tobias Blanke
Web2MARC: sharing and using STEM digital content in school libraries BIBAFull-Text 407-408
  Marcia A. Mardis; Casey McLaughlin; Grant Gingell
Digital content can benefit K-12 science, technology, engineering, and mathematics (STEM) teaching and learning, but it is not widely integrated. Many school librarians are not sure how to build upon their expertise to share and link digital learning resources in their roles as resource providers and instructional collaborators. This poster will present Web2MARC, a web-based application for integration of digital resources into school library collections. Further work on the state of school library STEM collections, survey analysis, and Web2MARC is slated to be complete in 2013.
Social network-based recommendation: a graph random walk kernel approach BIBAFull-Text 409-410
  Xin Li; Xin Su; Mengyue Wang
Traditional recommender system research often explores customer, product, and transaction information in providing recommendations. Social relationships in social networks are related to individuals' preferences. This study investigates the product recommendation problem based solely on people's social network information. Taking a kernel-based approach, we capture consumer social influence similarities into a graph random walk kernel and build SVR models to predict consumer opinions. In experiments on a dataset from a movie review website, our proposed model outperforms trust-based models and state-of-the-art graph kernels.
The David Livingstone spectral imaging project BIBAFull-Text 411-412
  Stephen Davison; Adrian S. Wisnicki; Elizabeth McAulay
The David Livingstone Spectral Imaging Project is a collaborative, international effort to use spectral imaging technology and digital publishing to make available a series of faded, illegible texts produced by the famous Victorian explorer when he was stranded without ink or writing paper in Central Africa. The poster describes existing achievements of the project, plans for an innovative portal providing access to images and data, and preservation challenges.
The logical form of the proposition expressed by a metadata record BIBAFull-Text 413-414
  Karen M. Wickett; Allen H. Renear
Metadata records are a ubiquitous and foundational feature of contemporary information systems. However, while their simple surface structure may lead us to think that the semantics of a metadata record is unproblematic and easily discerned, our analysis of an example record suggests otherwise. We show three possibilities for the logical form of the proposition expressed by a metadata record. All three are substantially different in the first order constructs utilized, and no two can be recognized as equivalent for the purposes of information organization. The semantics of the common metadata record is elusive. The main source of this problem appears to be the identifier attribute. Although identifier attributes have the syntactic appearance of any other attribute in the metadata vocabulary, this uniformity conceals their potential for assuming a distinctive semantic role, and one which appears to cross the traditional object language / metalanguage boundary, suggesting that translation of colloquial metadata records into logic-based knowledge representations does not take place entirely at a first-order level.
Toponym extraction and resolution in a digital library BIBAFull-Text 415-416
  James S. Creel; Katherine Weimer
Geospatial metadata enable rich and varied interfaces to digital collections, and present unique challenges and affordances for automated extraction. We describe our findings developing and utilizing a geoparser for the Texas A&M University Libraries' Institutional Repository.
Pinterest: social collecting for #linking #using #sharing BIBFull-Text 417-418
  Michael Zarro; Catherine Hall
YADDA2: assemble your own digital library application from LEGO bricks BIBAFull-Text 419-420
  Wojtek Sylwestrzak; Tomasz Rosiek; Lukasz Bolikowski
YADDA2 is an open software platform which facilitates creation of digital library applications. It consists of versatile building blocks providing, among others: storage, relational and full-text indexing, process management, and asynchronous communication. Its loosely-coupled service-oriented architecture enables deployment of highly-scalable, distributed systems.


An integrated participatory platform for human evaluation of machine translation BIBAFull-Text 421-422
  Jiangping Chen; Olajumoke Azogu; Wenqian Zhao
We describe the functions of HeMT, a multilingual participatory platform for Human Evaluation of Machine Translation. HeMT is used by three types of users including translators, evaluators, and reviewers. It consists of six major modules: User Management, Manual Translation, User Training, Evaluation, Result Visualization, and Multilingual Lexicon Management. HeMT can be used by Digital Libraries and Machine Translation communities for conducting manual translation, machine translation evaluation, and computer-assisted translation tasks.
'Erasmus': an organization- and user-centered Dublin Core metadata tool BIBAFull-Text 423-424
  Michael Khoo; Craig M. MacDonald; Joon Park
Digital library interoperability is supported by good quality metadata. The design of metadata creation and management tools is therefore an important component of overall digital library design. A number of factors affect metadata tool usability, including task complexity, interface usability, and organizational context of use. These issues are being addressed in the user-centered design of a metadata tool for the Internet Public Library.
Faceted search for heterogeneous digital collections BIBAFull-Text 425-426
  Hui Zhang; Mike Durbin; Jon Dunn; Will Cowan; Brian Wheeler
The idea of faceted search has received growing attentions in the digital library field for its potential of improving user satisfaction by combing the query and browse strategies interactively. Furthermore, with the trend of using digital repositories as the central infrastructure for curation and preservation, there is a demand for a single search interface providing public access to all the diversified content stored in the repositories. In this demo, we present Digital Collections Search, a system that is designed to assist users who are unfamiliar with the subject of their information needs locating relevant items as well exploring related but unknown collections in the repository.
NLM video search BIBAFull-Text 427-428
  John P. Doyle; Doron Shalvi; Edward C. Luczak
In this demo, we will demonstrate usage of NLM Video Search, open-source software which facilitates the dissemination of video content by combining traditional web video playback controls with on-demand seeking using text selected from a corresponding transcript (see Figure 1). NLM Video Search has been implemented in NLM's Fedora-based digital repository, which provides preservation and access to a growing number of rare, historical films and digitized texts.
Research discovery through linked open data BIBAFull-Text 429-430
  Paul Albert; Kristi L. Holmes; Katy Börner; Mike Conlon
VIVO is an open source semantic web platform that contains information about scholars and their interests and activities. This demonstration will highlight the platform and ontology, data sources, features of the software and the ways that VIVO data can be leveraged for a variety of purposes within and beyond an institution to facilitate collaboration and research discovery.
Structured audio content analysis and metadata in a digital library BIBAFull-Text 431-432
  David Bainbridge; John Stephen Downie; Andreas F. Ehmann
This work illustrates how audio content analysis of music and manually assigned structural temporal metadata can be used to form a digital library designed for musicological exploration. In addition to text-based searching and browsing, the document view is enriched with an interactive structured audio time-line that shows ground-truth data representing the logical segments to the song, and a version that was automatically generated for comparison. A self-similarity "heat" map is also displayed, and is interactive. Clicking within the map at a co-ordinate (x,y) results in the audio being played simultaneous at time offset x and y, panned left and right, respectively, to make it easier for the listener to separate out the differences. The musicologist can also initiate an audio content based query starting at any point in the song. This produces a ranked result set which can be further studied through their respective document views. Alternatively they can perform a musical structure search (for example, for songs that contain the structure b, b, c, b, c).
The profiles in science digital library: behind the scenes BIBAFull-Text 433-434
  Marie E. Gallagher; Christie Moffatt
This demonstration shows the Profiles in Science® digital library. Profiles in Science contains digitized selections from the personal manuscript collections of prominent biomedical researchers, medical practitioners, and those fostering science and health. The Profiles in Science Web site is the delivery mechanism for content derived from the digital library system. The system is designed according to our basic principles for digital library development [1]. The digital library includes the rules and software used for digitizing items, creating and editing database records and performing quality control as well as serving the digital content to the public. Among the types of data managed by the digital library are detailed item-level, collection-level and cross-collection metadata, digitized photographs, papers, audio clips, movies, born-digital electronic files, optical character recognized (OCR) text, and annotations (see Figure 1). The digital library also tracks the status of each item, including digitization quality, sensitivity of content, and copyright. Only items satisfying all required criteria are released to the public through the World Wide Web. External factors have influenced all aspects of the digital library's infrastructure.
The ResultsSpace collaborative search environment BIBAFull-Text 435-436
  Robert Capra; Jaime Arguello; Annie Chen; Katie Hawthorne; Gary Marchionini; Lee Shaw
The ResultsSpace Collaborative Search Environment is a tool to support asynchronous collaborative information retrieval among a small group of collaborators. It is designed to promote awareness of collaborators' searches and the documents they have rated. Awareness is supported through several mechanisms: an area that shows a history of queries, a summary display of collaborators' ratings next to each search result, and changes in the visual salience of search results based on their aggregate rating from all collaborators. Faceted controls allow users to filter results based on specific ratings (relevant, not relevant, and maybe) and on specific collaborator(s) who have rated an item. We describe features of the system, how they are implemented, and give insights into the design rationale.
WARCreate: create Wayback-consumable WARC files from any webpage BIBAFull-Text 437-438
  Mat Kelly; Michele C. Weigle
The Internet Archive's Wayback Machine is the most common way that typical users interact with web archives. The Internet Archive uses the Heritrix web crawler to transform pages on the publicly available web into Web ARChive (WARC) files, which can then be accessed using the Wayback Machine. Because Heritrix can only access the publicly available web, many personal pages (e.g. password-protected pages, social media pages) cannot be easily archived into the standard WARC format. We have created a Google Chrome extension, WARCreate, that allows a user to create a WARC file from any webpage. Using this tool, content that might have been otherwise lost in time can be archived in a standard format by any user. This tool provides a way for casual users to easily create archives of personal online content. This is one of the first steps in resolving issues of "long term storage, maintenance, and access of personal digital assets that have emotional, intellectual, and historical value to individuals".