HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 222324252627282930313233

ACM Transactions on Information Systems 32

Editors:Jamie Callan
Standard No:ISSN 1046-8188; HF S548.125 A33
Links:Table of Contents
  1. TOIS 2014-01 Volume 32 Issue 1
  2. TOIS 2014-04 Volume 32 Issue 2
  3. TOIS 2014-06 Volume 32 Issue 3
  4. TOIS 2014-10 Volume 32 Issue 4

TOIS 2014-01 Volume 32 Issue 1

Suffix Array Construction in External Memory Using D-Critical Substrings BIBAFull-Text 1
  Ge Nong; Wai Hong Chan; Sen Zhang; Xiao Feng Guan
We present a new suffix array construction algorithm that aims to build, in external memory, the suffix array for an input string of length n measured in the magnitude of tens of Giga characters over a constant or integer alphabet. The core of this algorithm is adapted from the framework of the original internal memory SA-DS algorithm that samples fixed-size d-critical substrings. This new external-memory algorithm, called EM-SA-DS, uses novel cache data structures to construct a suffix array in a sequential scanning manner with good data spatial locality: data is read from or written to disk sequentially. On the assumed external-memory model with RAM capacity Ω((nB)0.5), disk capacity O(n), and size of each I/O block B, all measured in log n-bit words, the I/O complexity of EM-SA-DS is O(n/B). This work provides a general cache-based solution that could be further exploited to develop external-memory solutions for other suffix-array-related problems, for example, computing the longest-common-prefix array, using a modern personal computer with a typical memory configuration of 4GB RAM and a single disk.
Document Score Distribution Models for Query Performance Inference and Prediction BIBAFull-Text 2
  Ronan Cummins
Modelling the distribution of document scores returned from an information retrieval (IR) system in response to a query is of both theoretical and practical importance. One of the goals of modelling document scores in this manner is the inference of document relevance. There has been renewed interest of late in modelling document scores using parameterised distributions. Consequently, a number of hypotheses have been proposed to constrain the mixture distribution from which document scores could be drawn.
   In this article, we show how a standard performance measure (i.e., average precision) can be inferred from a document score distribution using labelled data. We use the accuracy of the inference of average precision as a measure for determining the usefulness of a particular model of document scores. We provide a comprehensive study which shows that certain mixtures of distributions are able to infer average precision more accurately than others. Furthermore, we analyse a number of mixture distributions with regard to the recall-fallout convexity hypothesis and show that the convexity hypothesis is practically useful.
   Consequently, based on one of the best-performing score-distribution models, we develop some techniques for query-performance prediction (QPP) by automatically estimating the parameters of the document score-distribution model when relevance information is unknown. We present experimental results that outline the benefits of this approach to query-performance prediction.
Indexing Word Sequences for Ranked Retrieval BIBAFull-Text 3
  Samuel Huston; J. Shane Culpepper; W. Bruce Croft
Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance.
   In this article, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of (ε, δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term-dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary.
   Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.
Cost-Aware Collaborative Filtering for Travel Tour Recommendations BIBAFull-Text 4
  Yong Ge; Hui Xiong; Alexander Tuzhilin; Qi Liu
Advances in tourism economics have enabled us to collect massive amounts of travel tour data. If properly analyzed, this data could be a source of rich intelligence for providing real-time decision making and for the provision of travel tour recommendations. However, tour recommendation is quite different from traditional recommendations, because the tourist's choice is affected directly by the travel costs, which includes both financial and time costs. To that end, in this article, we provide a focused study of cost-aware tour recommendation. Along this line, we first propose two ways to represent user cost preference. One way is to represent user cost preference by a two-dimensional vector. Another way is to consider the uncertainty about the cost that a user can afford and introduce a Gaussian prior to model user cost preference. With these two ways of representing user cost preference, we develop different cost-aware latent factor models by incorporating the cost information into the probabilistic matrix factorization (PMF) model, the logistic probabilistic matrix factorization (LPMF) model, and the maximum margin matrix factorization (MMMF) model, respectively. When applied to real-world travel tour data, all the cost-aware recommendation models consistently outperform existing latent factor models with a significant margin.
Learning to Recommend Descriptive Tags for Questions in Social Forums BIBAFull-Text 5
  Liqiang Nie; Yi-Liang Zhao; Xiangyu Wang; Jialie Shen; Tat-Seng Chua
Around 40% of the questions in the emerging social-oriented question answering forums have at most one manually labeled tag, which is caused by incomprehensive question understanding or informal tagging behaviors. The incompleteness of question tags severely hinders all the tag-based manipulations, such as feeds for topic-followers, ontological knowledge organization, and other basic statistics. This article presents a novel scheme that is able to comprehensively learn descriptive tags for each question. Extensive evaluations on a representative real-world dataset demonstrate that our scheme yields significant gains for question annotation, and more importantly, the whole process of our approach is unsupervised and can be extended to handle large-scale data.

TOIS 2014-04 Volume 32 Issue 2

Efficient Index-Based Snippet Generation BIBAFull-Text 6
  Hannah Bast; Marjan Celikik
Ranked result lists with query-dependent snippets have become state of the art in text search. They are typically implemented by searching, at query time, for occurrences of the query words in the top-ranked documents. This document-based approach has three inherent problems: (i) when a document is indexed by terms which it does not contain literally (e.g., related words or spelling variants), localization of the corresponding snippets becomes problematic; (ii) each query operator (e.g., phrase or proximity search) has to be implemented twice, on the index side in order to compute the correct result set, and on the snippet-generation side to generate the appropriate snippets; and (iii) in a worst case, the whole document needs to be scanned for occurrences of the query words, which could be problematic for very long documents.
   We present a new index-based method that localizes snippets by information solely computed from the index and that overcomes all three problems. Unlike previous index-based methods, we show how to achieve this at essentially no extra cost in query processing time, by a technique we call operator inversion. We also show how our index-based method allows the caching of individual segments instead of complete documents, which enables a significantly larger cache hit-ratio as compared to the document-based approach. We have fully integrated our implementation with the CompleteSearch engine.
Modeling Term Associations for Probabilistic Information Retrieval BIBAFull-Text 7
  Jiashu Zhao; Jimmy Xiangji Huang; Zheng Ye
Traditionally, in many probabilistic retrieval models, query terms are assumed to be independent. Although such models can achieve reasonably good performance, associations can exist among terms from a human being's point of view. There are some recent studies that investigate how to model term associations/dependencies by proximity measures. However, the modeling of term associations theoretically under the probabilistic retrieval framework is still largely unexplored. In this article, we introduce a new concept cross term, to model term proximity, with the aim of boosting retrieval performance. With cross terms, the association of multiple query terms can be modeled in the same way as a simple unigram term. In particular, an occurrence of a query term is assumed to have an impact on its neighboring text. The degree of the query-term impact gradually weakens with increasing distance from the place of occurrence. We use shape functions to characterize such impacts. Based on this assumption, we first propose a bigram CRoss TErm Retrieval (CRTER2) model as the basis model, and then recursively propose a generalized n-gram CRoss TErm Retrieval (CRTERn) model for n query terms, where n > 2. Specifically, a bigram cross term occurs when the corresponding query terms appear close to each other, and its impact can be modeled by the intersection of the respective shape functions of the query terms. For an n-gram cross term, we develop several distance metrics with different properties and employ them in the proposed models for ranking. We also show how to extend the language model using the newly proposed cross terms. Extensive experiments on a number of TREC collections demonstrate the effectiveness of our proposed models.
Social-Sensed Image Search BIBAFull-Text 8
  Peng Cui; Shao-Wei Liu; Wen-Wu Zhu; Huan-Bo Luan; Tat-Seng Chua; Shi-Qiang Yang
Although Web search techniques have greatly facilitate users' information seeking, there are still quite a lot of search sessions that cannot provide satisfactory results, which are more serious in Web image search scenarios. How to understand user intent from observed data is a fundamental issue and of paramount significance in improving image search performance. Previous research efforts mostly focus on discovering user intent either from clickthrough behavior in user search logs (e.g., Google), or from social data to facilitate vertical image search in a few limited social media platforms (e.g., Flickr). This article aims to combine the virtues of these two information sources to complement each other, that is, sensing and understanding users' interests from social media platforms and transferring this knowledge to rerank the image search results in general image search engines. Toward this goal, we first propose a novel social-sensed image search framework, where both social media and search engine are jointly considered. To effectively and efficiently leverage these two kinds of platforms, we propose an example-based user interest representation and modeling method, where we construct a hybrid graph from social media and propose a hybrid random-walk algorithm to derive the user-image interest graph. Moreover, we propose a social-sensed image reranking method to integrate the user-image interest graph from social media and search results from general image search engines to rerank the images by fusing their social relevance and visual relevance. We conducted extensive experiments on real-world data from Flickr and Google image search, and the results demonstrated that the proposed methods can significantly improve the social relevance of image search results while maintaining visual relevance well.
Theoretical, Qualitative, and Quantitative Analyses of Small-Document Approaches to Resource Selection BIBAFull-Text 9
  Ilya Markov; Fabio Crestani
In a distributed retrieval setup, resource selection is the problem of identifying and ranking relevant sources of information for a given user's query. For better usage of existing resource-selection techniques, it is desirable to know what the fundamental differences between them are and in what settings one is superior to others. However, little is understood still about the actual behavior of resource-selection methods. In this work, we focus on small-document approaches to resource selection that rank and select sources based on the ranking of their documents. We pose a number of research questions and approach them by three types of analyses. First, we present existing small-document techniques in a unified framework and analyze them theoretically. Second, we propose using a qualitative analysis to study the behavior of different small-document approaches. Third, we present a novel experimental methodology to evaluate small-document techniques and to validate the results of the qualitative analysis. This way, we answer the posed research questions and provide insights about small-document methods in general and about each technique in particular.

TOIS 2014-06 Volume 32 Issue 3

Topic Modeling for Wikipedia Link Disambiguation BIBAFull-Text 10
  Bradley Skaggs; Lise Getoor
Many articles in the online encyclopedia Wikipedia have hyperlinks to ambiguous article titles; these ambiguous links should be replaced with links to unambiguous articles, a process known as disambiguation. We propose a novel statistical topic model based on link text, which we refer to as the Link Text Topic Model (LTTM), that we use to suggest new link targets for ambiguous links. To evaluate our model, we describe a method for extracting ground truth for this link disambiguation task from edits made to Wikipedia in a specific time period. We use this ground truth to demonstrate the superiority of LTTM over other existing link- and content-based approaches to disambiguating links in Wikipedia. Finally, we build a web service that uses LTTM to make suggestions to human editors wanting to fix ambiguous links in Wikipedia.
LCARS: A Spatial Item Recommender System BIBAFull-Text 11
  Hongzhi Yin; Bin Cui; Yizhou Sun; Zhiting Hu; Ling Chen
Newly emerging location-based and event-based social network services provide us with a new platform to understand users' preferences based on their activity history. A user can only visit a limited number of venues/events and most of them are within a limited distance range, so the user-item matrix is very sparse, which creates a big challenge to the traditional collaborative filtering-based recommender systems. The problem becomes even more challenging when people travel to a new city where they have no activity information.
   In this article, we propose LCARS, a location-content-aware recommender system that offers a particular user a set of venues (e.g., restaurants and shopping malls) or events (e.g., concerts and exhibitions) by giving consideration to both personal interest and local preference. This recommender system can facilitate people's travel not only near the area in which they live, but also in a city that is new to them. Specifically, LCARS consists of two components: offline modeling and online recommendation. The offline modeling part, called LCA-LDA, is designed to learn the interest of each individual user and the local preference of each individual city by capturing item cooccurrence patterns and exploiting item contents. The online recommendation part takes a querying user along with a querying city as input, and automatically combines the learned interest of the querying user and the local preference of the querying city to produce the top-k recommendations. To speed up the online process, a scalable query processing technique is developed by extending both the Threshold Algorithm (TA) and TA-approximation algorithm. We evaluate the performance of our recommender system on two real datasets, that is, DoubanEvent and Foursquare, and one large-scale synthetic dataset. The results show the superiority of LCARS in recommending spatial items for users, especially when traveling to new cities, in terms of both effectiveness and efficiency. Besides, the experimental analysis results also demonstrate the excellent interpretability of LCARS.
Georeferencing Wikipedia Documents Using Data from Social Media Sources BIBAFull-Text 12
  Olivier Van Laere; Steven Schockaert; Vlad Tanasescu; Bart Dhoedt; Christopher B. Jones
Social media sources such as Flickr and Twitter continuously generate large amounts of textual information (tags on Flickr and short messages on Twitter). This textual information is increasingly linked to geographical coordinates, which makes it possible to learn how people refer to places by identifying correlations between the occurrence of terms and the locations of the corresponding social media objects. Recent work has focused on how this potentially rich source of geographic information can be used to estimate geographic coordinates for previously unseen Flickr photos or Twitter messages. In this article, we extend this work by analysing to what extent probabilistic language models trained on Flickr and Twitter can be used to assign coordinates to Wikipedia articles. Our results show that exploiting these language models substantially outperforms both (i) classical gazetteer-based methods (in particular, using Yahoo! Placemaker and Geonames) and (ii) language modelling approaches trained on Wikipedia alone. This supports the hypothesis that social media are important sources of geographic information, which are valuable beyond the scope of individual applications.
XXS: Efficient XPath Evaluation on Compressed XML Documents BIBAFull-Text 13
  Nieves R. Brisaboa; Ana Cerdeira-Pena; Gonzalo Navarro
The eXtensible Markup Language (XML) is acknowledged as the de facto standard for semistructured data representation and data exchange on the Web and many other scenarios. A well-known shortcoming of XML is its verbosity, which increases manipulation, transmission, and processing costs. Various structure-blind and structure-conscious compression techniques can be applied to XML, and some are even access-friendly, meaning that the documents can be efficiently accessed in compressed form. Direct access is necessary to implement the query languages XPath and XQuery, which are the standard ones to exploit the expressiveness of XML. While a good deal of theoretical and practical proposals exist to solve XPath/XQuery operations on XML, only a few ones are well integrated with a compression format that supports the required access operations on the XML data. In this work we go one step further and design a compression format for XML collections that boosts the performance of XPath queries on the data. This is done by designing compressed representations of the XML data that support some complex operations apart from just accessing the data, and those are exploited to solve key components of the XPath queries. Our system, called XXS, is aimed at XML collections containing natural language text, which are compressed to within 35%-50% of their original size while supporting a large subset of XPath operations in time competitive with, and many times outperforming, the best state-of-the-art systems that work on uncompressed representations.
Content-Based Video Copy Detection Benchmarking at TRECVID BIBAFull-Text 14
  George Awad; Paul Over; Wessel Kraaij
This article presents an overview of the video copy detection benchmark which was run over a period of 4 years (2008-2011) as part of the TREC Video Retrieval (TRECVID) workshop series. The main contributions of the article include i) an examination of the evolving design of the evaluation framework and its components (system tasks, data, measures); ii) a high-level overview of results and best-performing approaches; and iii) a discussion of lessons learned over the four years. The content-based copy detection (CCD) benchmark worked with a large collection of synthetic queries, which is atypical for TRECVID, as was the use of a normalized detection cost framework. These particular evaluation design choices are motivated and appraised.
Trust Prediction via Belief Propagation BIBAFull-Text 15
  Richong Zhang; Yongyi Mao
The prediction of trust relationships in social networks plays an important role in the analytics of the networks. Although various link prediction algorithms for general networks may be adapted for this purpose, the recent notion of "trust propagation" has been shown to effectively capture the trust-formation mechanisms and resulted in an effective prediction algorithm. This article builds on the concept of trust propagation and presents a probabilistic trust propagation model. Our model exploits the modern framework of probabilistic graphical models, more specifically, factor graphs. Under this model, the trust prediction problem can be formulated as a statistical inference problem and we derive the belief propagation algorithm as a solver for trust prediction. The model and algorithm are tested using datasets from Epinions and Ciao, by which performance advantages over the previous algorithms are demonstrated.

TOIS 2014-10 Volume 32 Issue 4

Patent Query Formulation by Synthesizing Multiple Sources of Relevance Evidence BIBAFull-Text 16
  Parvaz Mahdabi; Fabio Crestani
Patent prior art search is a task in patent retrieval with the goal of finding documents which describe prior art work related to a query patent. A query patent is a full patent application composed of hundreds of terms which does not represent a single focused information need. Fortunately, other relevance evidence sources (i.e., classification tags and bibliographical data) provide additional details about the underlying information need. In this article, we propose a unified framework that integrates multiple relevance evidence components for query formulation. We first build a query model from the textual fields of a query patent. To overcome the term mismatch, we expand this initial query model with the term distribution of documents in the citation graph, modeling old and recent domain terminology. We build an IPC lexicon and perform query expansion using this lexicon incorporating proximity information. We performed an empirical evaluation on two patent datasets. Our results show that employing the temporal features of documents has a precision enhancing effect, while query expansion using IPC lexicon improves the recall of the final rank list.
Matrix Factorization with Explicit Trust and Distrust Side Information for Improved Social Recommendation BIBAFull-Text 17
  Rana Forsati; Mehrdad Mahdavi; Mehrnoush Shamsfard; Mohamed Sarwat
With the advent of online social networks, recommender systems have became crucial for the success of many online applications/services due to their significance role in tailoring these applications to user-specific needs or preferences. Despite their increasing popularity, in general, recommender systems suffer from data sparsity and cold-start problems. To alleviate these issues, in recent years, there has been an upsurge of interest in exploiting social information such as trust relations among users along with the rating data to improve the performance of recommender systems. The main motivation for exploiting trust information in the recommendation process stems from the observation that the ideas we are exposed to and the choices we make are significantly influenced by our social context. However, in large user communities, in addition to trust relations, distrust relations also exist between users. For instance, in Epinions, the concepts of personal "web of trust" and personal "block list" allow users to categorize their friends based on the quality of reviews into trusted and distrusted friends, respectively. Hence, it will be interesting to incorporate this new source of information in recommendation as well. In contrast to the incorporation of trust information in recommendation which is thriving, the potential of explicitly incorporating distrust relations is almost unexplored. In this article, we propose a matrix factorization-based model for recommendation in social rating networks that properly incorporates both trust and distrust relationships aiming to improve the quality of recommendations and mitigate the data sparsity and cold-start users issues. Through experiments on the Epinions dataset, we show that our new algorithm outperforms its standard trust-enhanced or distrust-enhanced counterparts with respect to accuracy, thereby demonstrating the positive effect that incorporation of explicit distrust information can have on recommender systems.
Browse-to-Search: Interactive Exploratory Search with Visual Entities BIBAFull-Text 18
  Shiyang Lu; Tao Mei; Jingdong Wang; Jian Zhang; Zhiyong Wang; Shipeng Li
With the development of image search technology, users are no longer satisfied with searching for images using just metadata and textual descriptions. Instead, more search demands are focused on retrieving images based on similarities in their contents (textures, colors, shapes etc.). Nevertheless, one image may deliver rich or complex content and multiple interests. Sometimes users do not sufficiently define or describe their seeking demands for images even when general search interests appear, owing to a lack of specific knowledge to express their intents. A new form of information seeking activity, referred to as exploratory search, is emerging in the research community, which generally combines browsing and searching content together to help users gain additional knowledge and form accurate queries, thereby assisting the users with their seeking and investigation activities. However, there have been few attempts at addressing integrated exploratory search solutions when image browsing is incorporated into the exploring loop. In this work, we investigate the challenges of understanding users' search interests from the images being browsed and infer their actual search intentions. We develop a novel system to explore an effective and efficient way for allowing users to seamlessly switch between browse and search processes, and naturally complete visual-based exploratory search tasks. The system, called Browse-to-Search enables users to specify their visual search interests by circling any visual objects in the webpages being browsed, and then the system automatically forms the visual entities to represent users' underlying intent. One visual entity is not limited by the original image content, but also encapsulated by the textual-based browsing context and the associated heterogeneous attributes. We use large-scale image search technology to find the associated textual attributes from the repository. Users can then utilize the encapsulated visual entities to complete search tasks. The Browse-to-Search system is one of the first attempts to integrate browse and search activities for a visual-based exploratory search, which is characterized by four unique properties: (1) in session -- searching is performed during browsing session and search results naturally accompany with browsing content; (2) in context -- the pages being browsed provide text-based contextual cues for searching; (3) in focus -- users can focus on the visual content of interest without worrying about the difficulties of query formulation, and visual entities will be automatically formed; and (4) intuitiveness -- a touch and visual search-based user interface provides a natural user experience. We deploy the Browse-to-Search system on tablet devices and evaluate the system performance using millions of images. We demonstrate that it is effective and efficient in facilitating the user's exploratory search compared to the conventional image search methods and, more importantly, provides users with more robust results to satisfy their exploring experience.
Exploiting Representations from Statistical Machine Translation for Cross-Language Information Retrieval BIBAFull-Text 19
  Ferhan Ture; Jimmy Lin
This work explores how internal representations of modern statistical machine translation systems can be exploited for cross-language information retrieval. We tackle two core issues that are central to query translation: how to exploit context to generate more accurate translations and how to preserve ambiguity that may be present in the original query, thereby retaining a diverse set of translation alternatives. These two considerations are often in tension since ambiguity in natural language is typically resolved by exploiting context, but effective retrieval requires striking the right balance. We propose two novel query translation approaches: the grammar-based approach extracts translation probabilities from translation grammars, while the decoder-based approach takes advantage of n-best translation hypotheses. Both are context-sensitive, in contrast to a baseline context-insensitive approach that uses bilingual dictionaries for word-by-word translation. Experimental results show that by "opening up" modern statistical machine translation systems, we can access intermediate representations that yield high retrieval effectiveness. By combining evidence from multiple sources, we demonstrate significant improvements over competitive baselines on standard cross-language information retrieval test collections. In addition to effectiveness, the efficiency of our techniques are explored as well.
Understanding Intrinsic Diversity in Web Search: Improving Whole-Session Relevance BIBAFull-Text 20
  Karthik Raman; Paul N. Bennett; Kevyn Collins-Thompson
Current research on Web search has focused on optimizing and evaluating single queries. However, a significant fraction of user queries are part of more complex tasks [Jones and Klinkner 2008] which span multiple queries across one or more search sessions [Liu and Belkin 2010; Kotov et al. 2011]. An ideal search engine would not only retrieve relevant results for a user's particular query but also be able to identify when the user is engaged in a more complex task and aid the user in completing that task [Morris et al. 2008; Agichtein et al. 2012]. Toward optimizing whole-session or task relevance, we characterize and address the problem of intrinsic diversity (ID) in retrieval [Radlinski et al. 2009], a type of complex task that requires multiple interactions with current search engines. Unlike existing work on extrinsic diversity [Carbonell and Goldstein 1998; Zhai et al. 2003; Chen and Karger 2006] that deals with ambiguity in intent across multiple users, ID queries often have little ambiguity in intent but seek content covering a variety of aspects on a shared theme. In such scenarios, the underlying needs are typically exploratory, comparative, or breadth-oriented in nature. We identify and address three key problems for ID retrieval: identifying authentic examples of ID tasks from post-hoc analysis of behavioral signals in search logs; learning to identify initiator queries that mark the start of an ID search task; and given an initiator query, predicting which content to prefetch and rank.
Cache Design of SSD-Based Search Engine Architectures: An Experimental Study BIBAFull-Text 21
  Jianguo Wang; Eric Lo; Man Lung Yiu; Jiancong Tong; Gang Wang; Xiaoguang Liu
Caching is an important optimization in search engine architectures. Existing caching techniques for search engine optimization are mostly biased towards the reduction of random accesses to disks, because random accesses are known to be much more expensive than sequential accesses in traditional magnetic hard disk drive (HDD). Recently, solid-state drive (SSD) has emerged as a new kind of secondary storage medium, and some search engines like Baidu have already used SSD to completely replace HDD in their infrastructure. One notable property of SSD is that its random access latency is comparable to its sequential access latency. Therefore, the use of SSDs to replace HDDs in a search engine infrastructure may void the cache management of existing search engines. In this article, we carry out a series of empirical experiments to study the impact of SSD on search engine cache management. Based on the results, we give insights to practitioners and researchers on how to adapt the infrastructure and caching policies for SSD-based search engines.