HCI Bibliography Home | HCI Conferences | DL Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DL Tables of Contents: 9697989900010203040506070809101112131415

JCDL'15: Proceedings of the 2015 ACM/IEEE-CS Joint Conference on Digital Libraries

Fullname:Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries
Editors:Paul Logasa Bogen, II; Suzie Allard; Holly Mercer; Micah Beck; Sally Jo Cunningham; Dion Goh; Geneva Henry
Location:Knoxville, Tennessee
Dates:2015-Jun-21 to 2015-Jun-25
Standard No:ISBN: 978-1-4503-3594-2; ACM DL: Table of Contents; hcibib: DL15
Links:Conference Website
  1. Keynotes
  2. Session 1 -- People and Their Books
  3. Session 2 -- Information Extraction
  4. Session 3 -- Big Data, Big Resources
  5. Session 4 -- Working the crowd
  6. Session 5 -- User Issues
  7. Session 6 -- Ontologies and Semantics
  8. Session 7 -- Non-text Collections
  9. Session 8 -- Temporality
  10. Session 9 -- Archiving, Repositories, and Content
  11. Poster & Demo Session
  12. Panels
  13. Tutorials
  14. Workshop Summaries


The Google Cultural Institute: Tools for Libraries, Archives, and Museums BIBAFull-Text 1
  Piotr Adamczyk
In 2011, Google launched the Google Art Project, an ever-growing repository of artworks from Museums around the globe, quickly followed by the expanded Google Cultural Institute. Efforts like these with the cultural sector use a combination of Google technologies and expert information provided by partner institutions to create unique online experiences. Spurred on by our partners, we've been adding features to our platform -- content hosting, embeddable image viewers, exhibit creation tools -- and making Google technology work for Museums -- high-resolution imaging, mobile publishing, and experiments in VR. Building these projects requires a deep understanding of library, archival, and museum practices and standards as well as providing tools that can be used by a wide array of partners at different stages of cataloging and digitization. So, how are we doing? We'll discuss reactions to the work so far, present some of our latest attempts to do more with cultural heritage online, and talk about how Google would like to further engage with cultural partners.
Moving the Needle: From Innovation to Impact BIBFull-Text 3
  Katherine Skinner
The HathiTrust Research Center: Providing analytic access to the HathiTrust Digital Library's 4.7 billion pages BIBAFull-Text 5
  J. Stephen Downie
This lecture provides an update on the recent developments and activities of the HathiTrust Research Center (HTRC). The HTRC is the research arm of the HathiTrust, an online repository dedicated to the provision of access to a comprehensive body of published works for scholarship and education. The HathiTrust is a partnership of over 100 major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. Membership is open to institutions worldwide. Over 13.1 million volumes (4.7 billion pages) have been ingested into the HathiTrust digital archive from sources including Google Books, member university libraries, the Internet Archive, and numerous private collections. The HTRC is dedicated to facilitating scholarship by enabling analytic access to the corpus, developing research tools, fostering research projects and communities, and providing additional resources such as enhanced metadata and indices that will assist scholars to more easily exploit the HathiTrust materials.
   This talk will outline the mission, goals and structure of the HTRC. It will also provide an overview of recent work being conducted on a range of projects, partnerships and initiatives. Projects include Workset Creation for Scholarly Analysis project (WCSA, funded by the Andrew W. Mellon Foundation) and the HathiTrust + Bookworm project (HT+BW, funded by the National Endowment for the Humanities). HTRC's involvement with the NOVEL™ text mining project and the Single Interface for Music Score Searching and Analysis (SIMSSA) project, both funded by the SSHRC Partnership Grant programme, will be introduced. The HTRC's new feature extraction and Data Capsule initiatives, part of its ongoing work its ongoing efforts to enable the non-consumptive analyses of the approximately 8 million volumes under copyright restrictions will also be discussed. The talk will conclude with some suggestions on how the non-consumptive research model might be improved upon and possibly extended beyond the HathiTrust context.

Session 1 -- People and Their Books

Result List Actions in Fiction Search BIBAFull-Text 7-16
  Pertti Vakkari; Janna Pöntinen
It is studied how users browse search results to find interesting novels for four search scenarios. It is evaluated in particular whether there are differences in search result page (SERP) browsing patterns and effectiveness between an enriched catalog for finding fiction compared to a traditional public library catalog. The data was collected from 30 participants by eye-tracking and questionnaires. The results indicate that the enriched catalog supported users to identify sooner and more effectively potentially clickable items on the results list compared to a traditional public library catalog. This is likely due to the more informative metadata in the enriched catalog like snippets of content description on the result list items. The discussion includes a theoretical and empirical comparison of findings in studies on fiction and non-fiction searching.
Where My Books Go: Choice and Place in Digital Reading BIBAFull-Text 17-26
  George Buchanan; Dana McKay; Joanna Levitt
Digital reading is a topic of rising interest in digital libraries, particularly in terms of optimizing the reading experience. However, there is relatively little data on the patterns of digital reading, including issues of where and what users read, and how they organize, plan and conduct their reading sessions. This paper reports the first data on mobile reading, combining insights from three different studies of users, including diary studies, interviews and ethnomethodological work. The data reveals that reading often depends on highly developed and rehearsed practices, especially when the reading is related to study or research. From this, we are able to identify a number of opportunities for further digital library research to better support the needs of users.
Books' Interest Grading and Fiction Readers' Search Actions During Query Reformulation Intervals BIBAFull-Text 27-36
  Anna Mikkonen; Pertti Vakkari
We compared fiction readers' search actions during various query reformulation intervals. We aimed to understand how readers' search actions differed between successful and unsuccessful QRIs and which search actions predicted the selecting of very interesting novels compared to less interesting ones. We conducted a controlled user study with 80 participants searching for interesting novels. Three types of browsing tasks and two types of catalogs were used. Our results demonstrated that browsing task type was associated to readers' document viewing behavior in terms of observed search result pages, opened book pages and dwell time on book pages. When browsing for topical novels, most effort was required to select somewhat interesting novels. When browsing for good novels, most effort was required to select very interesting ones. Logistic regression analysis yielded that the most significant predictors of higher document value were the number of observed search result pages and opened book pages.

Session 2 -- Information Extraction

Online Person Name Disambiguation with Constraints BIBAFull-Text 37-46
  Madian Khabsa; Pucktada Treeratpituk; C. Lee Giles
While many clustering techniques have been successfully applied to the person name disambiguation problem, most do not address two main practical issues: allowing constraints to be added to the clustering process, and allowing the data to be added incrementally without clustering the entire database. Constraints can be particularly useful especially in a system such as a digital library, where users are allowed to make corrections to the disambiguated result. For example, a user correction on a disambiguation result specifying that a record does not belong to an author could be kept as a cannot-link constraint to be used in any future disambiguation (such as when new documents are added). Besides such user corrections, constraints also allow background heuristics to be encoded into the disambiguation process. We propose a constraint-based clustering algorithm for person name disambiguation, based on DBSCAN combined with a pairwise distance based on random forests. We further propose an extension to the density-based clustering algorithm (DBSCAN) to handle online clustering so that the disambiguation process can be done iteratively as new data points are added.
   Our algorithm utilizes similarity features based on both metadata information and citation similarity. We implement two types of clustering constraints to demonstrate the concept. Experiments on the CiteSeer data show that our model can achieve 0.95 pairwise F1 and 0.79 cluster F1. The presence of constraints also consistently improves the disambiguation result across different combinations of features.
Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages BIBAFull-Text 47-56
  Sawood Alam; Fateh ud din B. Mehmood; Michael L. Nelson
We propose an approach to index raster images of dictionary pages which in turn would require very little manual effort to enable direct access to the appropriate pages of the dictionary for lookup. Accessibility is further improved by feedback and crowdsourcing that enables highlighting of the specific location on the page where the lookup word is found, annotation, digitization, and fielded searching. This approach is equally applicable on simple scripts as well as complex writing systems. Using our proposed approach, we have built a Web application called "Dictionary Explorer" which supports word indexes in various languages and every language can have multiple dictionaries associated with it. Word lookup gives direct access to appropriate pages of all the dictionaries of that language simultaneously. The application has exploration features like searching, pagination, and navigating the word index through a tree-like interface. The application also supports feedback, annotation, and digitization features. Apart from the scanned images, "Dictionary Explorer" aggregates results from various sources and user contributions in Unicode. We have evaluated the time required for indexing dictionaries of different sizes and complexities in the Urdu language and examined various trade-offs in our implementation. Using our approach, a single person can make a dictionary of 1,000 pages searchable in less than an hour.
Identifying Duplicate and Contradictory Information in Wikipedia BIBAFull-Text 57-60
  Sarah Weissman; Samet Ayhan; Joshua Bradley; Jimmy Lin
In this paper, we identify sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify sentences with high Jaccard similarity, followed by a pass to generate sentence clusters. Based on manual examination, we discovered that these clusters can be categorized into six different types: templates, identical sentences, copyediting, factual drift, references, and other. Two of these categories are particularly interesting: identical sentences quantify the extent to which content in Wikipedia is copied and pasted, and near-duplicate sentences that state contradictory facts point to quality issues in Wikipedia.
Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs BIBAFull-Text 61-64
  Nguyen Viet Cuong; Muthu Kumar Chandrasekaran; Min-Yen Kan; Wee Sun Lee
We address the tasks of recovering bibliographic and document structure metadata from scholarly documents. We leverage higher order semi-Markov conditional random fields to model long-distance label sequences, improving upon the performance of the linear-chain conditional random field model. We introduce the notion of extensible features, which allows the expensive inference process to be simplified through memoization, resulting in lower computational complexity. Our method significantly betters the state-of-the-art on three related scholarly document extraction tasks.

Session 3 -- Big Data, Big Resources

Towards Use And Reuse Driven Big Data Management BIBAFull-Text 65-74
  Zhiwu Xie; Yinlin Chen; Julie Speer; Tyler Walters; Pablo A. Tarazaga; Mary Kasarda
We propose a use and reuse driven big data management approach that fuses the data repository and data processing capabilities in a co-located, public cloud. It answers to the urgent data management needs from the growing number of researchers who don't fit in the big science/small science dichotomy. This approach will allow researchers to more easily use, manage, and collaborate around big data sets, as well as give librarians the opportunity to work alongside the researchers to preserve and curate data while it is still fresh and being actively used. This also provides the technological foundation to foster a sharing culture more aligned with the open source software development paradigm than the lone-wolf, gift-exchanging small science sharing or the top-down, highly structured big science sharing. To materialize this vision, we provide a system architecture consisting of a scalable digital repository system coupled with the co-located cloud storage and cloud computing, as well as a job scheduler and a deployment management system. Motivated by Virginia Tech's Goodwin Hall instrumentation project, we implemented and evaluated a prototype. The results show not only sufficient capacities for this particular case, but also near perfect linear storage and data processing scalabilities under moderately high workload.
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling BIBAFull-Text 75-84
  Gerhard Gossen; Elena Demidova; Thomas Risse
Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.
The Sum of All Human Knowledge in Your Pocket: Full-Text Searchable Wikipedia on a Raspberry Pi BIBAFull-Text 85-86
  Jimmy Lin
We demonstrate a prototype that takes advantage of open-source software to put a full-text searchable copy of Wikipedia on a Raspberry Pi, providing nearby devices access to content via wifi or bluetooth without requiring internet connectivity. This short paper articulates the advantages of such a form factor and provides an evaluation of browsing and search capabilities. We believe that personal digital libraries on lightweight mobile computing devices represent an interesting research direction to pursue.
Big Data Text Summarization for Events: A Problem Based Learning Course BIBAFull-Text 87-90
  Tarek Kanan; Xuan Zhang; Mohamed Magdy; Edward Fox
Problem/project Based Learning (PBL) is a highly effective student-centered teaching method, where student teams learn by solving problems. This paper describes an instance of PBL applied to digital library education. We show the design, implementation, results, and partial evaluation of a Computational Linguistics course that provides students an opportunity to engage in active learning about adding value to digital libraries with large collections of text, i.e., one aspect of "big data." Students are engaging in PBL with the semester long challenge of generating good English summaries of an event, given a large collection from our webpage archives. Six teams, each working with a different type of event, and applying three different summarization methods, learned how to generate good summaries; these have fair precision relative to the Wikipedia page that describes their event.

Session 4 -- Working the crowd

Multi-Emotion Estimation in Narratives from Crowdsourced Annotations BIBAFull-Text 91-100
  Lei Duan; Satoshi Oyama; Haruhiko Sato; Masahito Kurihara
Emotion annotations are important metadata for narrative texts in digital libraries. Such annotations are necessary for automatic text-to-speech conversion of narratives and affective education support and can be used as training data for machine learning algorithms to train automatic emotion detectors. However, obtaining high-quality emotion annotations is a challenging problem because it is usually expensive and time-consuming due to the subjectivity of emotion. Moreover, due to the multiplicity of "emotion", emotion annotations more naturally fit the paradigm of multi-label classification than that of multi-class classification since one instance (such as a sentence) may evoke a combination of multiple emotion categories. We thus investigated ways to obtain a set of high-quality emotion annotations ({instance, multi-emotion} paired data) from variable-quality crowdsourced annotations. A common quality control strategy for crowdsourced labeling tasks is to aggregate the responses provided by multiple annotators to produce a reliable annotation. Given that the categories of "emotion" have characteristics different from those of other kinds of labels, we propose incorporating domain-specific information of emotional consistencies across instances and contextual cues among emotion categories into the aggregation process. Experimental results demonstrate that, from a limited number of crowdsourced annotations, the proposed models enable gold standards to be more effectively estimated than the majority vote and the original domain-independent model.
Debugging a Crowdsourced Task with Low Inter-Rater Agreement BIBAFull-Text 101-110
  Omar Alonso; Catherine C. Marshall; Marc Najork
In this paper, we describe the process we used to debug a crowdsourced labeling task with low inter-rater agreement. In the labeling task, the workers' subjective judgment was used to detect high-quality social media content-interesting tweets-with the ultimate aim of building a classifier that would automatically curate Twitter content. We describe the effects of varying the genre and recency of the dataset, of testing the reliability of the workers, and of recruiting workers from different crowdsourcing platforms. We also examined the effect of redesigning the work itself, both to make it easier and to potentially improve inter-rater agreement. As a result of the debugging process, we have developed a framework for diagnosing similar efforts and a technique to evaluate worker reliability. The technique for evaluating worker reliability, Human Intelligence Data-Driven Enquiries (HIDDENs), differs from other such schemes, in that it has the potential to produce useful secondary results and enhance performance on the main task. HIDDEN subtasks pivot around the same data as the main task, but ask workers questions with greater expected inter-rater agreement. Both the framework and the HIDDENs are currently in use in a production environment.
Why Do Social Media Users Share Misinformation? BIBAFull-Text 111-114
  Xinran Chen; Sei-Ching Joanna Sin; Yin-Leng Theng; Chei Sian Lee
Widespread misinformation on social media is a cause of concern. Currently, it is unclear what factors prompt regular social media users with no malicious intent to forward misinformation to their online networks. Using a questionnaire informed by the Uses and Gratifications theory and the literature on rumor research, this study asked university students in Singapore why they shared misinformation on social media. Gender differences were also tested. The study found that perceived information characteristics such as its ability to spark conversations and its catchiness were top factors. Self-expression and socializing motivations were also among the top reasons. Women reported a higher prevalence of misinformation sharing. The implications for the design of social media applications and information literacy training were discussed.
Improving Consistency of Crowdsourced Multimedia Similarity for Evaluation BIBAFull-Text 115-118
  Peter Organisciak; J. Stephen Downie
Building evaluation datasets for information retrieval is a time-consuming and exhausting activity. To evaluate research over novel corpora, researchers are increasingly turning to crowdsourcing to efficiently distribute the evaluation dataset creation among many workers. However, there has been little investigation into the effect of instrument design on data quality in crowdsourced evaluation datasets. We pursue this question through a case study, music similarity judgments in a music digital library evaluation, where we find that even with trusted graders song pairs are not consistently rated the same. We find that much of this low intra-coder consistency can be attributed to the task design and judge effects, concluding with recommendations for achieving reliable evaluation judgments for music similarity and other normative judgment tasks.

Session 5 -- User Issues

What does Twitter Measure?: Influence of Diverse User Groups in Altmetrics BIBAFull-Text 119-128
  Simon Barthel; Sascha Tönnies; Benjamin Köhncke; Patrick Siehndel; Wolf-Tilo Balke
The most important goal for digital libraries is to ensure high quality search experience for all kinds of users. To attain this goal, it is necessary to have as much relevant metadata as possible at hand to assess the quality of publications. Recently, a new group of metrics appeared, that has the potential to raise the quality of publication metadata to the next level -- the altmetrics. These metrics try to reflect the impact of publications within the social web. However, currently it is still unclear if and how altmetrics should be used to assess the quality of a publication and how altmetrics are related to classical bibliographical metrics (like e.g. citations). To gain more insights about what kind of concepts are reflected by altmetrics, we conducted an in-depth analysis on a real world dataset crawled from the Public Library of Science (PLOS). Especially, we analyzed if the common approach to regard the users in the social web as one homogeneous group is sensible or if users need to be divided into diverse groups in order to receive meaningful results.
Unified Relevance Feedback for Multi-Application User Interest Modeling BIBAFull-Text 129-138
  Sampath Jayarathna; Atish Patra; Frank Shipman
A user often interacts with multiple applications while working on a task. User models can be developed individually at each of the individual applications, but there is no easy way to come up with a more complete user model based on the distributed activity of the user. To address this issue, this research studies the importance of combining various implicit and explicit relevance feedback indicators in a multi-application environment. It allows different applications used for different purposes by the user to contribute user activity and its context to mutually support users with unified relevance feedback. Using the data collected by the web browser, Microsoft Word and Microsoft PowerPoint, combinations of implicit relevance feedback with semi-explicit relevance feedback were analyzed and compared with explicit user ratings. Our results are two-fold: first we demonstrate the aggregation of implicit and semi-explicit user interest data across multiple everyday applications using our Interest Profile Manager (IPM) framework. Second, our experimental results show that incorporating implicit feedback with semi-explicit feedback for page-level user interest estimation resulted in a significant improvement over the content-based models.
PivotViz: Interactive Visual Analysis of Multidimensional Library Transaction Data BIBAFull-Text 139-142
  Matthias Nielsen; Kaj Grønbæk
As public libraries become increasingly digitalized they become producers of Big Data. Furthermore, public libraries are often obliged to make their data openly available as part of national open data policies, which have gained momentum in many countries including USA, UK, and Denmark. However, in order to utilize such data and make it intelligible for citizens, decision makers, or other stakeholders, raw data APIs are insufficient. Therefore, we have developed PivotViz that is a comprehensible visualization technique, which combines parallel coordinates and pivot tables. It provides, a multidimensional visual interactive pivot table for analysis of library transactions -- loans, renewals, and returns of books and other materials across location and time. The paper presents the PivotViz technique and discusses its prospects based on implementations in two publicly available versions using open data from the two largest municipalities in Denmark. Examples of analysis results from these data illustrate the power of PivotViz.
User and Topical Factors in Perceived Self-Efficacy of Video Digital Libraries BIBAFull-Text 143-146
  Dan Albertson; Boryung Ju
A survey measured users' perceived self-efficacy about interactively retrieving digital video, both overall and according to different factors potentially related to user confidence preceding an actual video search session. A total of 270 surveys, with quantifiable responses, were collected and analyzed. T-tests and correlation tests produced significant findings about users' levels of perceived self-efficacy, including associations with topic familiarly, type or nature of the information need, and system context. Findings give researchers a better understanding of users' confidence and preconceptions prior to interactive information retrieval (IIR) sessions for video, providing valuable insight about users' attitudes which can be used to promote initial and continued use of interactive tools like digital libraries.

Session 6 -- Ontologies and Semantics

Improving Access to Large-scale Digital Libraries ThroughSemantic-enhanced Search and Disambiguation BIBAFull-Text 147-156
  Annika Hinze; Craig Taube-Schock; David Bainbridge; Rangi Matamua; J. Stephen Downie
With 13,000,000 volumes comprising 4.5 billion pages of text, it is currently very difficult for scholars to locate relevant sets of documents that are useful in their research from the HathiTrust Digital Library (HTDL) using traditional lexically-based retrieval techniques. Existing document search tools and document clustering approaches use purely lexical analysis, which cannot address the inherent ambiguity of natural language. A semantic search approach offers the potential to overcome the shortcoming of lexical search, but even if an appropriate network of ontologies could be decided upon it would require a full semantic markup of each document. In this paper, we present a conceptual design and report on the initial implementation of a new framework that affords the benefits of semantic search while minimizing the problems associated with applying existing semantic analysis at scale. Our approach avoids the need for complete semantic document markup using pre-existing ontologies by developing an automatically generated Concept-in-Context (CiC) network seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system analyzes documents by the semantics and context of their content. The disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. Our method achieves a form of semantic-enhanced search that simultaneously exploits the proven scale benefits provided by lexical indexing.
Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach BIBAFull-Text 157-164
  Jose Maria Gonzalez Pinto; Wolf-Tilo Balke
Efforts to make highly specialized knowledge accessible through scientific digital libraries need to go beyond mere bibliographic metadata, since here information search is mostly entity-centric. Previous work has realized this trend and developed different methods to recognize and (to some degree even automatically) annotate several important types of entities: genes and proteins, chemical structures and molecules, or drug names to name but a few. Moreover, such entities are often crossreferenced with entries in curated databases. However, several questions still remain to be answered: Given a scientific discipline what are the important entities? How can they be automatically identified? Are really all of them relevant, i.e. do all of them carry deeper semantics for assessing a publication? How can they be represented, described, and subsequently annotated? How can they be used for search tasks? In this work we focus on answering some of these questions. We claim that to bring the use of scientific digital libraries to the next level we must find treat topic-specific entities as first class citizens and deeply integrate their semantics into the search process. To support this we propose a novel probabilistic approach that not only successfully provides a solution to the integration problem, but also demonstrates how to leverage the knowledge encoded in entities and provide insights to explore the use of our approach in different scenarios. Finally, we show how our results can benefit information providers.
An Ontological Framework for Describing Games BIBAFull-Text 165-168
  David Dubin; Jacob Jett
This paper describes an ontological framework for game description. Games are a multi-billion dollar industry and are cultural heritage objects studied by a growing number of scholars. The conceptual model described here supports the description of both individual games and relationships among games, their versions and variants for more effective discovery, more reliable provenance, and detailed scoping of copyright, patent, and trademark claims.
Building Complex Research Collections in Digital Libraries: A Survey of Ontology Implications BIBAFull-Text 169-172
  Terhi Nurmikko-Fuller; Kevin R. Page; Pip Willcox; Jacob Jett; Chris Maden; Timothy Cole; Colleen Fallaw; Megan Senseney; J. Stephen Downie
Bibliographic metadata standards are a longstanding mechanism for Digital Libraries to manage records and express relationships between them. As digital scholarship, particularly in the humanities, incorporates and manipulates these records in an increasingly direct manner, existing systems are proving insufficient for providing the underlying addressability and relational expressivity required to construct and interact with complex research collections. In this paper we describe motivations for these "worksets" and the technical requirements they raise. We survey the coverage of existing bibliographic ontologies in the context of meeting these scholarly needs, and finally provide an illustrated discussion of potential extensions that might fully realize a solution.

Session 7 -- Non-text Collections

WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document BIBAFull-Text 173-182
  Yuehan Wang; Liangcai Gao; Simeng Wang; Zhi Tang; Xiaozhong Liu; Ke Yuan
Nowadays, mathematical information is increasingly available in websites and repositories, such like ArXiv, Wikipedia and growing numbers of digital libraries. Mathematical formulae are highly structured and usually presented in layout presentations, such as PDF, LATEX and Presentation MathML. The differences of presentation between text and formulae challenge traditional text-based index and retrieval methods. To address the challenge, this paper proposes an upgraded Mathematical Information Retrieval (MIR) system, namely WikiMirs 3.0, based on the context, structure and importance of formulae in a document. In WikiMirs 3.0, users can easily "cut" formulae and contexts from PDF documents as well as type in queries. Furthermore, a novel hybrid indexing and matching model is proposed to support both exact and fuzzy matching. In the hybrid model, both context and structure information of formulae are taken into consideration. In addition, the concept of formula importance within a document is introduced into the model for more reasonable ranking. Experimental results, compared with two classical MIR systems, demonstrate that the proposed system along with the novel model provides higher accuracy and better ranking results over Wikipedia.
Topic Modeling Users' Interpretations of Songs to Inform Subject Access in Music Digital Libraries BIBAFull-Text 183-186
  Kahyun Choi; Jin Ha Lee; Craig Willis; J. Stephen Downie
The assignment of subject metadata to music is useful for organizing and accessing digital music collections. Since manual subject annotation of large-scale music collections is labor-intensive, automatic methods are preferred. Topic modeling algorithms can be used to automatically identify latent topics from appropriate text sources. Candidate text sources such as song lyrics are often too poetic, resulting in lower-quality topics. Users' interpretations of song lyrics provide an alternative source. In this paper, we propose an automatic topic discovery system from web-mined user-generated interpretations of songs to provide subject access to a music digital library. We also propose and evaluate filtering techniques to identify high-quality topics. In our experiments, we use 24,436 popular songs that exist in both the Million Song Dataset and songmeanings.com. Topic models are generated using Latent Dirichlet Allocation (LDA). To evaluate the coherence of learned topics, we calculate the Normalized Pointwise Mutual Information (NPMI) of the top ten words in each topic based on occurrences in Wikipedia. Finally, we evaluate the resulting topics using a subset of 422 songs that have been manually assigned to six subjects. Using this system, 71% of the manually assigned subjects were correctly identified. These results demonstrate that topic modeling of song interpretations is a promising method for subject metadata enrichment in music digital libraries. It also has implications for affording similar access to collections of poetry and fiction.
Towards a Distributed Digital Library for Sign Language Content BIBAFull-Text 187-190
  Frank Shipman; Ricardo Gutierrez-Osuna; Tamra Shipman; Caio Monteiro; Virendra Karappa
The Internet provides access to content in almost all languages through a combination of crawling, indexing, and ranking capabilities. The ability to locate content on almost any topic has become expected for most users. But it is not the case for those whose primary language is a sign language. Members of this community communicate via the Internet, but they pass around links to videos via email and social media. In this paper, we describe the need for, the architecture of, and initial software components of a distributed digital library of sign language content, called SLaDL. Our initial efforts have been to develop a model of collection development that enables community involvement without assuming it. This goal necessitated the development of video processing techniques that automatically detect sign language content in video.
Analyzing News Events in Non-Traditional Digital Library Collections BIBAFull-Text 191-194
  Martin Klein; Peter Broadwell
Digital libraries are called upon to organize, aggregate, and steward born-digital news collections. Rather than continuously building silos of such non-traditional collections, digital libraries are seeking to manage these collections in conjunction with each other in order to provide the most value to scholars. We here present the results of a preliminary study analyzing characteristics of items in two collections of digital news media: television broadcasts and social media coverage. Our findings indicate a number of factors that similar efforts will need to take into consideration when linking digital "news" collections similar to ours.

Session 8 -- Temporality

Time will Tell: Temporal Linking of News Stories BIBAFull-Text 195-204
  Thomas Bögel; Michael Gertz
Readers of news articles are typically faced with the problem of getting a good understanding of a complex story covered in an article. However, as news articles mainly focus on current or recent events, they often do not provide sufficient information about the history of an event or topic, leaving the user alone in discovering and exploring other news articles that might be related to a given article. This is a time consuming and non-trivial task, and the only help provided by some news outlets is some list of related articles or a few links within an article itself. What further complicates this task is that many of today's news stories cover a wide range of topics and events even within a single article, thus leaving the realm of traditional approaches that track a single topic or event over time.
   In this paper, we present a framework to link news articles based on temporal expressions that occur in the articles, following the idea "if an article refers to something in the past, then there should be an article about that something". Our approach aims to recover the chronology of one or more events and topics covered in an article, leading to an information network of articles that can be explored in a thematic and particular chronological fashion. For this, we propose a measure for the relatedness of articles that is primarily based on temporal expressions in articles but also exploits other information such as persons mentioned and keywords. We provide a comprehensive evaluation that demonstrates the functionality of our framework using a multi-source corpus of recent German news articles.
Predicting Temporal Intention in Resource Sharing BIBAFull-Text 205-214
  Hany M. SalahEldeen; Michael L. Nelson
When users post links to web pages in Twitter there is a time delta between when the post was shared (t tweet) and when it was read (t click). Ideally, when this time delta is small there is often no change in the page's state. However upon reading shared content in the past and due to the dynamic nature of the web, the page's state could change and the intention of the author need to be inferred. In this work, we enhance a prior temporal intention model and tackle its shortcomings by incorporating extended linguistic feature analysis, replacing the prior textual similarity measure with semantic similarity one based on latent topic detection trained on Wikipedia English corpus, and finally by enriching and balancing the training dataset. We uncovered three different intention behaviors in respect to time: Stable Intention, Changing Intention from current to past, and Undefined intention. Using these classes and only the information available at posting time from the tweet and the current state of the resource, we correctly predict the temporal intention classification and strength with 77% accuracy.

Session 9 -- Archiving, Repositories, and Content

Before the Repository: Defining the Preservation Threats to Research Data in the Lab BIBAFull-Text 215-222
  Stacy T. Kowalczyk
This paper describes the results of a large survey designed to quantify the risks and threats to the preservation of the research data in the lab and to determine the mitigating actions of researchers. A total of 724 National Science Foundation awardees completed this survey. Identifying risks and threats to digital preservation has been a significant research stream. Much of this work has been within the context of a preservation technology infrastructure such as data archives for a digital repository. This study looks at the risks and threats to research data prior to its inclusion in a preservation technology infrastructure. The greatest threat to preservation is human error, followed by equipment malfunction, obsolete software, and data corruption. Lost and mislabeled media are not components in the threat taxonomies developed for repositories; however, they do represent an important threat to research data in the lab. Researchers have recognized the need to mitigate the risks inherent in maintaining digital data by implementing data management in their lab environments and have taken their responsibility as data managers seriously; however, they would still prefer to have professional data management support.
How Well Are Arabic Websites Archived? BIBAFull-Text 223-232
  Lulwah M. Alkwai; Michael L. Nelson; Michele C. Weigle
It is has long been anecdotally known that web archives and search engines favor Western and English-language sites. In this paper we quantitatively explore how well indexed and archived are Arabic language web sites. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multi-lingual), Raddadi and Star28 (both primarily Arabic language). Using language identification tools we eliminated pages not in the Arabic language (e.g., English language versions of Al-Jazeera sites) and culled the collection to 7,976 definitely Arabic language web pages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We discovered: 1) 46% are not archived and 31% are not indexed by Google (www.google.com), 2) only 14.84% of the URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country, 3) having either only an Arabic GeoIP or only an Arabic top-level domain appears to negatively impact archiving, 4) most of the archived pages are near the top level of the site and deeper links into the site are not well-archived, 5) the presence in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving.
No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving BIBAFull-Text 233-236
  Ke Zhou; Claire Grover; Martin Klein; Richard Tobin
The citation of resources is a fundamental part of scholarly discourse. Due to the popularity of the web, there is an increasing trend for scholarly articles to reference web resources (e.g. software, data). However, due to the dynamic nature of the web, the referenced links may become inaccessible ('rotten') sometime after publication, returning a "404 Not Found" HTTP error. In this paper we first present some preliminary findings of a study of the persistence and availability of web resources referenced from papers in a large-scale scholarly repository. We reaffirm previous research that link rot is a serious problem in the scholarly world and that current web archives do not always preserve all rotten links. Therefore, a more pro-active archival solution needs to be developed to further preserve web content referenced in scholarly articles. To this end, we propose to apply machine learning techniques to train a link rot predictor for use by an archival framework to prioritise pro-active archiving of links that are more likely to be rotten. We demonstrate that we can obtain a fairly high link rot prediction AUC (0.72) with only a small set of features. By simulation, we also show that our prediction framework is more effective than current web archives for preserving links that are likely to be rotten. This work has a potential impact for the scholarly world where publishers can utilise this framework to prioritise the archiving of links for digital preservation, especially when there is a large quantity of links to be archived.
The Problem of "Additional Content" in Video Games BIBAFull-Text 237-240
  Jin Ha Lee; Jacob Jett; Andrew Perti
Additional content for video games such as mods (modifications) or DLC (downloadable content) are increasingly prevalent in the current video game market. For cultural heritage institutions with video game collections, such content introduces various philosophical and practical challenges on multiple aspects including acquisition, description, access/use, and preservation. In this paper, we discuss these challenges and propose a solution that can alleviate the problem of managing a digital library collection including video games with additional content. While our discussion and proposed solution focus on video games, they also have broader implications for cultural heritage institutions that manage other types of digital and multimedia objects with additional content as well as serial publications.

Poster & Demo Session

Using Transactional Web Archives To Handle Server Errors BIBAFull-Text 241-242
  Zhiwu Xie; Prashant Chandrasekar; Edward A. Fox
We describe a web archiving application that handles server errors using the most recently archived representation of the requested web resource. The application is developed as an Apache module. It leverages the transactional web archiving tool SiteStory, which archives all previously accessed representations of web resources originating from a website. This application helps to improve the website's quality of service by temporarily masking server errors from the end user and gaining precious time for the system administrator to debug and recover from server failures. By providing pertinent support to website operations, we aim to reduce the resistance to transactional web archiving, which in turn may lead to a better coverage of web history.
Mobile Mink: Merging Mobile and Desktop Archived Webs BIBAFull-Text 243-244
  Wesley Jordan; Mat Kelly; Justin F. Brunelle; Laura Vobrak; Michele C. Weigle; Michael L. Nelson
We describe the mobile app Mobile Mink which extends Mink, a browser extension that integrates the live and archived web. Mobile Mink discovers mobile and desktop URIs and provides the user an aggregated TimeMap of both mobile and desktop mementos. Mobile Mink also allows users to submit mobile and desktop URIs for archiving at the Internet Archive and Archive.today. Mobile Mink helps to increase the archival coverage of the growing mobile web.
Combination Effects of Word-based and Extended Co-citation Search Algorithms BIBAFull-Text 245-246
  Masaki Eto
In the field of academic document search, citations are often used for measuring implicit relationships between documents. Recently, some studies have attempted to extend co-citation searching. However, these studies mainly focus on comparisons of traditional co-citation and extended co-citation search methods; combination effects of word-based and extended co-citation search algorithms have not yet been sufficiently evaluated. This paper empirically evaluates the search performance of the combination search by using a test collection comprising about 152,000 documents and a metric 'precision at k.' The experimental results indicate that the combination search outperforms two baseline methods: a word-based search and a combination search of word-based and traditional co-citation search algorithms.
Studying Chinese-English Mixed Language Queries from the User Perspectives BIBAFull-Text 247-248
  Hengyi Fu; Shuheng Wu
With the increasing number of multilingual webpages on the Internet, cross-language information retrieval has become an important research issue. Using Activity Theory as a theoretical framework, this study employs semi-structured interviews with key informants who are frequent users of Chinese-English mixed language queries in web searching. The findings present the context of and reasons for using Chinese-English mixed language queries, which can inform the design of cross-language controlled vocabularies and information retrieval systems.
Content Analysis of Social Tags Generated by Health Consumers BIBAFull-Text 249-250
  Soohyung Joo; Yunseon Choi
This poster presents preliminary findings of user tag analysis in the domain of consumer health information. To obtain user terms, 36,205 tags from 38 consumer health information sites were collected from delicious.com. Content analysis was applied to identify the dimensions and types of the collected tags. The preliminary findings showed that user generated tags covers a variety of aspects of health information, ranging from general terms, subject terms, knowledge type, and to audience. General terms and subject terms were observed dominantly by showing 31.7% and 22.8% respectively.
Automatic Classification of Research Documents using Textual Entailment BIBAFull-Text 251-252
  Bolanle Adefowoke Ojokoh; Olatunji Mumini Omisore; Oluwarotimi Williams Samuel
Exploring the accumulative nature of Internet documents has become a rising issue that requires systematic ways to construct what we need from what we have. Manual and semi-manual document classification techniques have facilitated retrieval and maintenance of document repositories for easy access; however, they are customarily painstaking and labor-intensive. Herein, we propose a document classification model using automatic access of natural language meaning. The model is made up of application, business, and storage layers. The business layer, as a core component, automatically extracts sentences containing keywords from research documents and classifies them using the geometrical similarity of their sentential entailments.
Case Study of Waiting List on WPLC Digital Library BIBAFull-Text 253-254
  Wooseob Jeong; Hyejung Han; Laura Ridenour
With the increasing popularity of e-books and audiobooks provided by public libraries in the U.S., the demand does not seem to be met with sufficient supply, as many popular titles require months of waiting time. In this study, we collected data from the Wisconsin Public Library Consortium's digital libraries service once a day for more than two months for selected popular titles. This data reflects the current supply and demand of popular titles in public libraries' digital library services. Based on our data analysis and observation, we suggest ways to achieve faster circulation, which ultimately allows for better services to library users.
Analyzing Tagging Patterns by Integrating Visual Analytics with the Inferential Test BIBAFull-Text 255-256
  Yunseon Choi
Due to the large volume and complexity of data, exploring data using visual analytics has become more helpful to interpret and analyze it. The box plot is one of graphical ways and is the most common technique for presenting and summarizing statistics. In this paper, we focus on discussing the tagging patterns by integrating visualization assessment using the box plot with the Shapiro-Wilk test.
ConfAssist: A Conflict Resolution Framework for Assisting the Categorization of Computer Science Conferences BIBAFull-Text 257-258
  Mayank Singh; Tanmoy Chakraborty; Animesh Mukherjee; Pawan Goyal
Classifying publication venues into top-tier or non top-tier is quite subjective and can be debatable at times. sIn this paper, we propose ConfAssist, a novel assisting framework for conference categorization that aims to address the limitations in the existing systems and portals for venue classification. We identify various features related to the stability of conferences that might help us separate a top-tier conference from the rest of the lot. While there are many clear cases where expert agreement can be almost immediately achieved as to whether a conference is a top-tier or not, there are equally many cases that can result in a conflict even among the experts. ConfAssist tries to serve as an aid in such cases by increasing the confidence of the experts in their decision. A human judgment survey was conducted with 28 domain experts. The results were quite impressive with 91.6% classification accuracy.
Combining Classifiers and User Feedback for Disambiguating Author Names BIBAFull-Text 259-260
  Emília A. de Souza; Anderson A. Ferreira; Marcos André Gonçalves
Historically, supervised methods have been the most effective ones for author name disambiguation tasks. In here, we propose a specific manner to combine supervised techniques along with user feedback. Although, we use supervised techniques, the only user effort is to provide feedback on results since initial training data is automatically generated. Our experiments show gains up to 20% in the disambiguation performance against representative baselines.
Using the Business Model Canvas to Support a Risk Assessment Method for Digital Curation BIBAFull-Text 261-262
  Diogo Proença; Ahmad Nadali; José Borbinha
This poster presents a pragmatic risk assessment method based on best practice from the ISO 31000 family of standards regarding risk management. The method proposed is supported by established risk management concepts that can be applied to help a data repository to gain awareness of the risks and costs of the controls for the identified risks. In simple terms the technique that supports this method is a pragmatic risk registry that can be used to identify risks from a Business Model Canvas of an organization. A Business Model Canvas is a model used in strategic management to document existing business models and develop new ones.
Grading Degradation in an Institutionally Managed Repository BIBAFull-Text 263-264
  Luis Meneses; Sampath Jayarathna; Richard Furuta; Frank Shipman
It is not unusual for digital collections to degrade and suffer from problems associated with unexpected change. In an analysis of the ACM conference list, we found that categorizing the degree of change affecting a digital collection over time is a difficult task. More specifically, we found that categorizing this degree of change is not a binary problem where documents are either unchanged or they have changed so dramatically that they do not fit within the scope of the collection. It is, in part, a characterization of the intent of the change. In this work, we examine and categorize the various degrees of change that digital documents endure within the boundaries of an institutionally managed repository.
Automatically Generating a Concept Hierarchy with Graphs BIBAFull-Text 265-266
  Pucktada Treeratpituk; Madian Khabsa; C. Lee Giles
We propose a novel graph-based approach for constructing concept hierarchy from a large text corpus. Our algorithm incorporates both statistical co-occurrences and lexical similarity in optimizing the structure of the taxonomy. To automatically generate topic-dependent taxonomies from a large text corpus, we first extracts topical terms and their relationships from the corpus. The algorithm then constructs a weighted graph representing topics and their associations. A graph partitioning algorithm is then used to recursively partition the topic graph into a taxonomy. For evaluation, we apply our approach to articles, primarily computer science, in the CiteSeerX digital library and search engine.
Taxonomy Induction and Taxonomy-based Recommendations for Online Courses BIBAFull-Text 267-268
  Shuo Yang; Yansong Feng; Lei Zou; Aixia Jia; Dongyan Zhao
Taxonomy is a useful and ubiquitous way to organize knowledge. As online education attracting more and more attention, organizing lecture notes or exercises, from different online sources, in a more structured form has become an effective way to navigate users to better access course materials. However, it is expensive and time consuming to manually annotate large amounts of corpora to build a detailed taxonomy. In this paper, we propose a taxonomy induction framework with limited human involvement. We also show that the constructed taxonomy can be used to improve lecture notes/exercises recommendations.
Analyzing User Requests for Anime Recommendations BIBAFull-Text 269-270
  Jin Ha Lee; Yuna Shim; Jacob Jett
Anime is increasingly becoming recognized as an important commercial product and cultural artifact. However, little is known regarding users' information needs and behavior related to anime. This study specifically attempts to improve our understanding of how people seek anime recommendations. We analyzed 546 user questions in natural language, collected from a Korean Q&A website Naver Knowledge-iN, where users are asking for anime recommendations. The findings suggest the importance of establishing robust metadata for the seven commonly used features for anime recommenders (i.e., title, genre, artistic style, story, character description, series title, and mood) in digital libraries, as well as allowing users to specify known anime and series titles as examples for seeking similar items, or examples of the kinds of items to be excluded.
Computationally Supported Collection-level Descriptions in Large Heterogeneous Metadata Aggregations BIBAFull-Text 271-272
  Unmil P. Karadkar; Karen Wickett; Madhura Parikh; Richard Furuta; Joshua Sheehy; Meghanath Reddy Junnutula; Jeremy Tzou
The Computational Collection Description project is developing mechanisms for generating field-specific collection-level descriptors from item values. Using the Digital Public Library of America (DPLA) as a sample data set, we describe a flexible, extensible architecture for processing field-level values, an augmented Collection class to record the generated metadata, and our early results of enhancements for a DPLA collection.
Read between the lines: A Machine Learning Approach for Disambiguating the Geo-location of Tweets BIBAFull-Text 273-274
  Sunshin Lee; Mohamed Farag; Tarek Kanan; Edward A. Fox
This paper describes a Machine Learning (ML) approach for extracting named entities and disambiguating the location of tweets based on those named entities and related content. We conducted experiments with tweets (e.g., about potholes), and found significant improvement in disambiguating tweet locations using a ML algorithm along with the Stanford NER. Adding state information predicted by our classifiers increases the possibility to find the state-level geo-location unambiguously by up to 80%.
Modeling Faceted Browsing with Category Theory to Support Interoperability and Reuse BIBAFull-Text 275-276
  Daniel R. Harris
Faceted browsing has become ubiquitous with modern digital libraries and online search engines, yet the process is still difficult to abstractly model in a manner that supports the development of interoperable and reusable interfaces. Existing efforts in facet modeling are based upon set theory, formal concept analysis, and light-weight ontologies, but in many regards, they are implementations of faceted browsing rather than a specification of the basic, underlying structures and interactions. We propose category theory as a theoretical foundation for faceted browsing and demonstrate how the interactive process can be mathematically abstracted in a way that naturally supports interoperability and reuse.
An Instrument for Merging of Bibliographic Databases BIBAFull-Text 277-278
  Anna A. Knyazeva; Oleg S. Kolobov; Fjodor E. Tatarsky; Igor Yu. Turchanovsky
The process of merging two or more library catalogues is considered in this paper. It's necessary to solve the problem of duplicate detection and merging into one database instead of simple union of different resources. The toolbox Cflib for duplicate detection and merging has been developed by us. It's based on standard principles of record linkage and has quite simple architecture.
Databrary: Enabling Sharing and Reuse of Research Video BIBAFull-Text 279-280
  Dylan A. Simon; Andrew S. Gordon; Lisa Steiger; Rick O. Gilmore
Video and audio recordings serve as a primary data source in many fields, especially in the social and behavioral sciences. Recordings present unique opportunities for reuse and reanalysis for novel scientific purposes, but also present challenges related to respecting the privacy of individuals depicted. Databrary is a web-based service for sharing and reusing the video data created by researchers in the developmental and learning sciences. By investigating how researchers organize, analyze, and mine their own recordings, we have implemented a system that empowers researchers to capture, store, and share recordings in a standardized way. This demo will provide a tour through the Databrary service, highlighting how it promotes storage, management, sharing, and reuse of research data, controls access privileges to restricted human subject data, and facilitates browsing and discoverability of datasets.
The RMap Project: Capturing and Preserving Associations amongst Multi-Part Distributed Publications BIBAFull-Text 281-282
  Karen L. Hanson; Tim DiLauro; Mark Donoghue
The goal of the RMap Project is to create a prototype service that can capture and preserve maps of relationships amongst the increasingly distributed components (article, data, software, workflow objects, multimedia, etc.) that comprise the new model for scholarly publication. The demonstration will provide a tour of some of the features of the initial web service prototype. This will include examples of Distributed Scholarly Complex Objects (DiSCOs) and associated provenance data in RMap, as well as some of the options that users might have for interacting with the framework.
5ex+y: Searching over Mathematical Content in Digital Libraries BIBAFull-Text 283-284
  Arthur Oviedo; Nikos Kasioumis; Karl Aberer
This paper presents 5ex+y, a system that is able to extract, index and query mathematical content expressed as mathematical expressions, complementing the CERN Document Server (CDS). We present the most important aspects of its design, our approach to model the relevant features of the mathematical content, and provide a demonstration of its searching capabilities.
Reconstruction of the US First Website BIBAFull-Text 285-286
  Ahmed AlSum
The Web idea started on 1989 with a proposal from Sir Tim Berners-Lee. The first US website has been developed at SLAC on 1991. This early version of the Web and the subsequent updates until 1998 have been preserved by SLAC archive and history office for many years. In this paper, we discuss the strategy and techniques to reconstruct this early website and make it available through Stanford Web Archive Portal.


Lifelong Digital Libraries BIBAFull-Text 287
  Cathal Gurrin; Frank Hopfgartner
The organisation of personal data is receiving increasing research attention due to the challenges that are faced in gathering, enriching, searching and visualising this data. Given the increasing quantities of personal data being gathered by individuals, the concept of a lifelong digital library of rich multimedia and sensory content for every individual is becoming a reality. This panel brought together researchers from different parts of the information retrieval and digital libraries community to debate the opportunities and challenges for researchers in this new and challenging area.
Organizational Strategies for Cultural Heritage Preservation BIBAFull-Text 289
  Paul Logasa, II Bogen; Katherine Skinner; Piotr Adamczyk; Unmil Karadkar
Cultural Heritage content is increasingly being both created digitally and digitized. Preserving this content has been a much discussed and debated question in the Digital Libraries and Digital Humanities communities. Many concerns that have been raised around the organizational challenges. Centralized preservation is often praised for unified access and consistency, but at the same are criticized for their reliance on the continued interest of a smaller number of maintainers. Alternatively, decentralized preservation leads to better longevity but often at a cost of consistency or ease of access. Beyond this question, there are many other organizational issues. Such as the role of states and commercial entities in preservation; and, dealing with concerns about ownership, privacy and acceptable use of materials. This panel will discuss these issues with the goal of finding a balance between these often conflicting approaches.


Introduction to Digital Libraries BIBAFull-Text 291
  Edward A. Fox
This tutorial is a thorough and deep introduction to the DL field, providing a firm foundation: covering key concepts and terminology, as well as services, systems, technologies, methods, standards, projects, issues, and practices. It introduces and builds upon a firm theoretical foundation (starting with the '5S' set of intuitive aspects: Streams, Structures, Spaces, Scenarios, Societies), giving careful definitions and explanations of all the key parts of a 'minimal digital library', and expanding from that basis to cover key DL issues. Illustrations come from a set of case studies. Attendees will be exposed to four Morgan & Claypool books that elaborate on 5S, 2012-2014. Complementing the coverage of '5S' will be an overview of key aspects of the DELOS Reference Model and DL.org activities. Further, use of a Hadoop cluster supporting DLs will be described.
Digital Data Curation Essentials for Data Scientists and Data Curators and Librarians BIBAFull-Text 293-294
  Helen R. Tibbo; Carolyn Hank
This paper describes a detailed description of a full-day data digital curation tutorial held at JCDL'15.
Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research BIBAFull-Text 295
  Jaimie Murdock; Jiaan Zeng; Robert H. McDonald
In this half-day tutorial, we will show 1) how the HathiTrust Research Center (HTRC) Data Capsule can be used for non-consumptive research over collection of texts and 2) how integrated tools for LDA topic modeling and visualization can be used to drive formulation of new research questions. Participants will be given an account in the HTRC Data Capsule and taught how to use the workset manager to create a corpus, and then use the VM's secure mode to download texts and analyze their contents.
Automatic Methods for Disambiguating Author Names in Bibliographic Data Repositories BIBAFull-Text 297-298
  Anderson A. Ferreira; Marcos André Gonçalves; Alberto H. F. Laender
Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. This problem occurs when an author publishes works under distinct names or distinct authors publish works under similar names. The challenges of dealing with author name ambiguity have led to a myriad of name disambiguation methods. In this tutorial, we characterize such methods by means of a proposed taxonomy, present an overview of some of the most representative ones and discuss open challenges.

Workshop Summaries

WOSP2015: 4th International Workshop on Mining Scientific Publications BIBFull-Text 299-301
  Petr Knoth; Kris Jack; Lucas Anastasiou; Nuno Freire; Nancy Pontika; Drahomira Herrmannova
Web Archiving and Digital Libraries (WADL) BIBAFull-Text 303
  Edward A. Fox; Zhiwu Xie
This workshop will explore integration of Web archiving and digital libraries, so the complete life cycle involved is covered: creation/authoring, uploading/publishing in the Web (2.0), (focused) crawling, indexing, exploration (searching, browsing), ..., archiving (of events). It will include particular coverage of current topics of interest:, big data, mobile web archiving, and systems (e.g., Memento, SiteStory, Uninterruptible Web Service).