HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 1011-111-212-112-213-113-214-114-215-115-2

Companion Proceedings of the 2015 International Conference on the World Wide Web

Fullname:Companion Proceedings of the 24th International Conference on World Wide Web
Editors:Aldo Gangemi; Stefano Leonardi; Alessandro Panconesi
Location:Florence, Italy
Dates:2015-May-18 to 2015-May-22
Standard No:ISBN: 978-1-4503-3473-0; ACM DL: Table of Contents; hcibib: WWW15-2
Links:Conference Website
  1. WWW 2015-05-18 Volume 2
    1. Posters
    2. Demonstrations
    3. WebSci Track Papers & Posters
    4. Industrial Track
    5. PhD Symposium
    6. AW4CITY 2015
    7. BigScholar 2015
    8. DAEN 2015
    9. KET 2015
    10. LiLE 2015
    11. LIME 2015
    12. LocWeb 2015
    13. MSM 2015
    14. MWA 2015
    15. NewsWWW 2015
    16. OOEW 2015
    17. RDSM 2015
    18. SAVE-SD 2015
    19. SIMPLEX 2015
    20. SocialNLP 2015
    21. SOCM 2015
    22. SWDM 2015
    23. TargetAd 2015
    24. TempWeb 2015
    25. WDS4SC 2015
    26. WebET 2015
    27. WebQuality 2015
    28. WIC 2015
    29. WSREST 2015
    30. Tutorials
    31. Workshop Summaries

WWW 2015-05-18 Volume 2


Ads Keyword Rewriting Using Search Engine Results BIBAFull-Text 3-4
  Javad Azimi; Adnan Alam; Ruofei Zhang
Paid Search (PS) ads are one of the main revenue sources of online advertising companies where the goal is returning a set of relevant ads for a searched query in search engine websites such as Bing. Typical PS algorithms, return the ads which their Bided Keywords (BKs) are a subset of searched queries or relevant to them. However, there is a huge gap between BKs and searched queries as a considerable amount of BKs are rarely searched by the users. This is mostly due to the rare BKs provided by advertisers. In this paper, we propose an approach to rewrite the rare BKs to more commonly searched keywords, without compromising the original BKs intent, which increases the coverage and depth of PS ads and thus it delivers higher monetization power. In general, we first find the relevant web documents pertaining to the BKs and then extract common keywords using the web doc title and its summary snippets. Experimental results show the effectiveness of proposed algorithm in rewriting rare BKs and consequently providing us with a significant improvement in recall and thereby revenue.
Abstractive Meeting Summarization Using Dependency Graph Fusion BIBAFull-Text 5-6
  Siddhartha Banerjee; Prasenjit Mitra; Kazunari Sugiyama
Automatic summarization techniques on meeting conversations developed so far have been primarily extractive, resulting in poor summaries. To improve this, we propose an approach to generate abstractive summaries by fusing important content from several utterances. Any meeting is generally comprised of several discussion topic segments. For each topic segment within a meeting conversation, we aim to generate a one sentence summary from the most important utterances using an integer linear programming-based sentence fusion approach. Experimental results show that our method can generate more informative summaries than the baselines.
Towards Semantic Retrieval of Hashtags in Microblogs BIBAFull-Text 7-8
  Piyush Bansal; Somay Jain; Vasudeva Varma
On various microblogging platforms like Twitter, the users post short text messages ranging from news and information to thoughts and daily chatter. These messages often contain keywords called Hashtags, which are semantico-syntactic constructs that enable topical classification of the microblog posts. In this poster, we propose and evaluate a novel method of semantic enrichment of microblogs for a particular type of entity search -- retrieving a ranked list of the top-k hashtags relevant to a user's query Q. Such a list can help the users track posts of their general interest. We show that our technique significantly improved microblog retrieval as well. We tested our approach on the publicly available Stanford sentiment analysis tweet corpus. We observed an improvement of more than 10% in NDCG for microblog retrieval task, and around 11% in mean average precision for hashtag retrieval task.
Modeling and Predicting Popularity Dynamics of Microblogs using Self-Excited Hawkes Processes BIBAFull-Text 9-10
  Peng Bao; Hua-Wei Shen; Xiaolong Jin; Xue-Qi Cheng
The ability to model and predict the popularity dynamics of individual user generated items on online media has important implications in a wide range of areas. In this paper, we propose a probabilistic model using a Self-Excited Hawkes Process (SEHP) to characterize the process through which individual microblogs gain their popularity. This model explicitly captures the triggering effect of each forwarding, distinguishing itself from the reinforced Poisson process based model where all previous forwardings are simply aggregated as a single triggering effect. We validate the proposed model by applying it on Sina Weibo, the most popular microblogging network in China. Experimental results demonstrate that the SEHP model consistently outperforms the model based on reinforced Poisson process.
Evaluating User Targeting Policies: Simulation Based on Randomized Experiment Data BIBAFull-Text 11-12
  Joel Barajas; Ram Akella; Marius Holtan
We propose a user targeting simulator for online display advertising. Based on the response of 37 million visiting users (targeted and non-targeted) and their features, we simulate different user targeting policies. We provide evidence that the standard conversion optimization policy shows similar effectiveness to that of a random targeting, and significantly inferior to other causally optimized targeting policies.
A Comparison of Supervised Keyphrase Extraction Models BIBAFull-Text 13-14
  Florin Bulgarov; Cornelia Caragea
Keyphrases for a document provide a high-level topic description of the document. Given the number of documents growing exponentially on the Web in the past years, accurate methods for extracting keyphrases from such documents are greatly needed. In this study, we provide a comparison of existing supervised approaches to this task to determine the current best performing model. We use research articles on the Web as the case study.
ControVol: Let Yesterday's Data Catch Up with Today's Application Code BIBAFull-Text 15-16
  Thomas Cerqueus; Eduardo Cunha de Almeida
In building software-as-a-service applications, a flexible development environment is key to shipping early and often. Therefore, schema-flexible data stores are becoming more and more popular. They can store data with heterogeneous structure, allowing for new releases to be pushed frequently, without having to migrate legacy data first. However, the current application code must continue to work with any legacy data that has already been persisted in production. To let legacy data structurally "catch up" with the latest application code, developers commonly employ object mapper libraries with life-cycle annotations. Yet when used without caution, they can cause runtime errors and even data loss. We present ControVol, an IDE plugin that detects evolutionary changes to the application code that are incompatible with legacy data. ControVol warns developers already at development time, and even suggests automatic fixes for lazily migrating legacy data when it is loaded into the application. Thus, ControVol ensures that the structure of legacy data can catch up with the structure expected by the latest software release.
Dataset Descriptions for Optimizing Federated Querying BIBAFull-Text 17-18
  Angelos Charalambidis; Stasinos Konstantopoulos; Vangelis Karkaletsis
Dataset description vocabularies focus on provenance, versioning, licensing, and similar metadata. VoID is a notable exception, providing some expressivity for describing subsets and their contents and can, to some extent, be used for discovering relevant resources and for optimizing querying. In this poster we describe an extension of VoID that provides the expressivity needed in order to support the query planning methods typically used in federated querying.
Online Learning to Rank: Absolute vs. Relative BIBAFull-Text 19-20
  Yiwei Chen; Katja Hofmann
Online learning to rank holds great promise for learning personalized search result rankings. First algorithms have been proposed, namely absolute feedback approaches, based on contextual bandits learning; and relative feedback approaches, based on gradient methods and inferred preferences between complete result rankings. Both types of approaches have shown promise, but they have not previously been compared to each other. It is therefore unclear which type of approach is the most suitable for which online learning to rank problems. In this work we present the first empirical comparison of absolute and relative online learning to rank approaches.
Mouse Clicks Can Recognize Web Page Visitors! BIBAFull-Text 21-22
  Daniela Chuda; Peter Kratky; Jozef Tvarozek
Behavioral biometrics based on mouse usage can be used to recognize one's identity, with special applications in anonymous Web browsing. Out of many features that describe browsing behavior, mouse clicks (or touches) as the most basic of navigation actions, provide a stable stream of behavioral data. The paper describes a method to recognize Web user according to three click features. The distance-based classification comparing cumulative distribution functions achieves high recognition accuracy even with hundreds of users.
Geo Data Annotator: a Web Framework for Collaborative Annotation of Geographical Datasets BIBAFull-Text 23-24
  Stefano Cresci; Davide Gazzè; Angelica Lo Duca; Andrea Marchetti; Maurizio Tesconi
In this paper we illustrate the Geo Data Annotator (GDA), a framework which helps a user to build a ground-truth dataset, starting from two separate geographical datasets. GDA exploits two kinds of indices to ease the task of manual annotation: geographical-based and string-based. GDA provides also a mechanism to evaluate the quality of the built ground-truth dataset. This is achieved through a collaborative platform, which allows many users to work to the same project. The quality evaluation is based on annotator agreement, which exploits the Fleiss' kappa statistic.
Online View Maintenance for Continuous Query Evaluation BIBAFull-Text 25-26
  Soheila Dehghanzadeh; Alessandra Mileo; Daniele Dell'Aglio; Emanuele Della Valle; Shen Gao; Abraham Bernstein
In Web stream processing, there are queries that integrate Web data of various velocity, categorized broadly as streaming (i.e., fast changing) and background (i.e., slow changing) data. The introduction of local views on the background data speeds up the query answering process, but requires maintenance processes to keep the replicated data up-to-date. In this work, we study the problem of maintaining local views in a Web setting, where background data are usually stored remotely, are exposed through services with constraints on the data access (e.g., invocation rate limits and data access patterns) and, contrary to the database setting, do not provide streams with changes over their content. Then, we propose an initial solution: WBM, a method to maintain the content of the view with regards to query and user-defined constraints on accuracy and responsiveness.
FedWeb Greatest Hits: Presenting the New Test Collection for Federated Web Search BIBAFull-Text 27-28
  Thomas Demeester; Dolf Trieschnigg; Dong Nguyen; Djoerd Hiemstra; Ke Zhou
This paper presents 'FedWeb Greatest Hits', a large new test collection for research in web information retrieval. As a combination and extension of the datasets used in the TREC Federated Web Search Track, this collection opens up new research possibilities on federated web search challenges, as well as on various other problems.
Hate Speech Detection with Comment Embeddings BIBAFull-Text 29-30
  Nemanja Djuric; Jing Zhou; Robin Morris; Mihajlo Grbovic; Vladan Radosavljevic; Narayan Bhamidipati
We address the problem of hate speech detection in online user comments. Hate speech, defined as an "abusive speech targeting specific group characteristics, such as ethnicity, religion, or gender", is an important problem plaguing websites that allow users to leave feedback, having a negative impact on their online business and overall user experience. We propose to learn distributed low-dimensional representations of comments using recently proposed neural language models, that can then be fed as inputs to a classification algorithm. Our approach addresses issues of high-dimensionality and sparsity that impact the current state-of-the-art, resulting in highly efficient and effective hate speech detectors.
What's Hot in The Theme: Query Dependent Emerging Topic Extraction from Social Streams BIBAFull-Text 31-32
  Yuki Endo; Hiroyuki Toda; Yoshimasa Koike
Analyzing emerging topics from social media enables users to overview social movement and several web services to adopt current trends. Although existing studies mainly focus on extracting global emerging topics, efficient extraction of local ones related to a specific theme is still a challenging and unavoidable problem in social media analysis. We focus on extracting emerging social topics related to user-specified query words, and propose an extraction framework that uses a non-negative matrix factorization (NMF) modified for detecting temporal concentration and reducing noises. We conduct preliminary experiments for verifying our method using a Twitter dataset.
Fast Search for Distance Dependent Chinese Restaurant Processes BIBAFull-Text 33-34
  Weiwei Feng; Peng Wang; Chuan Zhou; Li Guo; Peng Zhang
The distance dependent Chinese Restaurant Processes (dd-CRP), a nonparametric Bayesian model, can model distance sensitive data. Existing inference algorithms for dd-CRP, such as Markov Chain Monte Carlo (MCMC) and variational algorithms, are inefficient and unable to handle massive online data, because posterior distributions of dd-CRP are not marginal invariant. To solve this problem, we present a fast inference algorithm for dd-CRP based on the A-star search. Experimental results show that the new search algorithm is faster than existing dd-CRP inference algorithms with comparable results.
ASIM: A Scalable Algorithm for Influence Maximization under the Independent Cascade Model BIBAFull-Text 35-36
  Sainyam Galhotra; Akhil Arora; Srinivas Virinchi; Shourya Roy
The steady growth of graph data from social networks has resulted in wide-spread research in finding solutions to the influence maximization problem. Although, TIM is one of the fastest existing algorithms, it cannot be deemed scalable owing to its exorbitantly high memory footprint.cIn this paper, we address the scalability aspect -- memory consumption and running time of the influence maximization problem. We propose ASIM, a scalable algorithm capable of running within practical compute times on commodity hardware. Empirically, ASIM is 6-8 times faster when compared to CELF++ with similar memory consumption, while its memory footprint is ≈200 times smaller when compared to TIM.
Search Retargeting using Directed Query Embeddings BIBAFull-Text 37-38
  Mihajlo Grbovic; Nemanja Djuric; Vladan Radosavljevic; Narayan Bhamidipati
Determining user audience for online ad campaigns is a critical problem to companies competing in online advertising space. One of the most popular strategies is search retargeting, which involves targeting users that issued search queries related to advertiser's core business, commonly specified by advertisers themselves. However, advertisers often fail to include many relevant queries, which results in suboptimal campaigns and negatively impacts revenue for both advertisers and publishers. To address this issue, we use recently proposed neural language models to learn low-dimensional, distributed query embeddings, which can be used to expand query lists with related queries through simple nearest neighbor searches in the embedding space. Experiments on real-world data set strongly suggest benefits of the approach.
Identifying Successful Investors in the Startup Ecosystem BIBAFull-Text 39-40
  Srishti Gupta; Robert Pienta; Acar Tamersoy; Duen Horng Chau; Rahul C. Basole
Who can spot the next Google, Facebook, or Twitter? Who can discover the next billion-dollar startups? Measuring investor success is a challenging task, as investment strategies can vary widely. We propose InvestorRank, a novel method for identifying successful investors by analyzing how an investor's collaboration network change over time. InvestorRank captures the intuition that a successful investor achieves increasingly success in spotting great startups, or is able to keep doing so persistently. Our results show potential in discovering relatively unknown investors that may be the success stories of tomorrow.
Towards Serving "Delicious" Information within Its Freshness Date BIBAFull-Text 41-42
  Hao Han; Takashi Nakayama; Junxia Guo; Keizo Oyama
Like freshness date of food, Web information also has its "shelf life". In this paper, we exploratively study the reflection of shelf life of information in browsing behaviors. Our analysis shows that the satisfaction of browsing behavior is modified if the shelf life of information could be considered by search engines.
FluTCHA: Using Fluency to Distinguish Humans from Computers BIBAFull-Text 43-44
  Kotaro Hara; Mohammad T. Hajiaghayi; Benjamin B. Bederson
Improvements in image understanding technologies are making it possible for computers to pass traditional CAPTCHA tests with high probability. This suggests the need for new kinds of tasks that are easy to accomplish for humans but remain difficult for computers. In this paper, we introduce Fluency CAPTCHA (FluTCHA), a novel method to distinguish humans from computers using the fact that humans are better than machines at improving the fluency of sentences. We propose a way to let users work on FluTCHA tests and simultaneously complete useful linguistic tasks. Evaluation demonstrates the feasibility of using FluTCHA to distinguish humans from computers.
Online Event Recommendation for Event-based Social Networks BIBAFull-Text 45-46
  Xiancai Ji; Zhi Qiao; Mingze Xu; Peng Zhang; Chuan Zhou; Li Guo
With the rapid growth of event-based social networks, the demand of event recommendation becomes increasingly important. While, the existing event recommendation approaches are batch learning fashion. Such approaches are impractical for real-world recommender systems where training data often arrive sequentially. Hence, we present an online event recommendation method. Experimental results on several real-world datasets demonstrate the utility of our method.
Entity-driven Type Hierarchy Construction for Freebase BIBAFull-Text 47-48
  Jyun-Yu Jiang; Pu-Jen Cheng; Chin-Yew Lin
The hierarchical structure of a knowledge base system can lead to various valuable applications; however, many knowledge base systems do not have such property. In this paper, we propose an entity-driven approach to automatically construct the hierarchical structure of entities for knowledge base systems. By deriving type dependencies from entity information, the initial graph of types will be constructed, and then modified to become a hierarchical structure by several graph algorithms. Experimental results show the effectiveness of our method in terms of constructing reasonable type hierarchy for knowledge base systems.
Multi-Aspect Collaborative Filtering based on Linked Data for Personalized Recommendation BIBAFull-Text 49-50
  Han-Gyu Ko; Joo-Sik Son; In-Young Ko
Since users often consider more than one aspect when they choose an item, relevant researches introduced multi-criteria recommender systems and showed that multi-criteria ratings add values to the existing CF-based recommender systems to provide more accurate recommendation results to users. However, all the previous works require multi-criteria ratings given by users explicitly while most of the existing datasets such as Netflix and MovieLens are a single criterion. Therefore, to take advantage of multi-criteria recommendation, there must be a way to extract necessary aspects and analyze users' preferences on those aspects from the given single-criterion type of dataset. In this paper, we propose an approach of utilizing semantic information of items to extract essential aspects to perform multi-aspect collaborative filtering to recommend users with items in a personalized manner.
Extracting Taxonomies from Bipartite Graphs BIBAFull-Text 51-52
  Tobias Kötter; Stephan Günnemann; Christos Faloutsos; Michael R. Berthold
Given a large bipartite graph that represents objects and their properties, how can we automatically extract semantic information that provides an overview of the data and -- at the same time -- enables us to drill down to specific parts for an in-depth analysis? In this work in-progress paper, we propose extracting a taxonomy that models the relation between the properties via an is-a hierarchy. The extracted taxonomy arranges the properties from general to specific providing different levels of abstraction.
Tweet-Recommender: Finding Relevant Tweets for News Articles BIBAFull-Text 53-54
  Ralf Krestel; Thomas Werkmeister; Timur Pratama Wiradarma; Gjergji Kasneci
Twitter has become a prime source for disseminating news and opinions. However, the length of tweets prohibits detailed descriptions; instead, tweets sometimes contain URLs that link to detailed news articles. In this paper, we devise generic techniques for recommending tweets for any given news article. To evaluate and compare the different techniques, we collected tens of thousands of tweets and news articles and conducted a user study on the relevance of recommendations.
Temporality in Online Food Recipe Consumption and Production BIBAFull-Text 55-56
  Tomasz Kusmierczyk; Christoph Trattner; Kjetil Nørvåg
In this paper, we present work-in-progress of a recently started research effort that aims at understanding the hidden temporal dynamics in online food communities. In this context, we have mined and analyzed temporal patterns in terms of recipe production and consumption in a large German community platform. As our preliminary results reveal, there are indeed a range of hidden temporal patterns in terms of food preferences and in particular in consumption and production. We believe that this kind of research can be important for future work in personalized Web-based information access and in particular recommender systems.
Finding the Differences between the Perceptions of Experts and the Public in the Field of Diabetes BIBAFull-Text 57-58
  Dahee Lee; Won Chul Kim; Min Song
Automatic information extraction techniques such as named entity recognition and relation extraction have been developed but it is yet rare to apply them to various document types. In this paper, we applied them to academic literature and social media's contents in the field of diabetes to find distinctions between the perceptions of biomedical experts and the public. We analyzed and compared the experts' and the public's networks constituted by the extracted entities and relations. The results confirmed that there are some differences in their views, i.e., biomedical entities that interest them and relations within their knowledge range.
A Graph-Based Recommendation Framework for Price-Comparison Services BIBAFull-Text 59-60
  Sang-Chul Lee; Sang-Wook Kim; Sunju Park
In this paper, we propose a set of recommendation strategies and develop a graph-based framework for recommendation in online price-comparison services. We verify the superiority of the proposed framework by comparing it with existing methods using real-world data.
A Descriptive Analysis of a Large-Scale Collection of App Management Activities BIBAFull-Text 61-62
  Huoran Li; Xuanzhe Liu; Wei Ai; Qiaozhu Mei; Feng Feng
Smartphone users adopt an increasing number of mobile applications (a.k.a., apps) in the recent years. Investigating how people manage mobile apps in their everyday lives creates a unique opportunity to understand the behaviors and preferences of mobile users. Existing literature provides very limited understanding about app management activities, due to the lack of user behavioral data at scale. This paper analyzes a very large collection of app management log of the users of a leading Android app marketplace in China. The data set covers one month of detailed activities of how users download, update, and uninstall the apps on their smart devices, involving 8,306,181 anonymized users and 394,661 apps. We characterize how these users manage the apps on their devices and identify behavioral patterns that correlate with users' online ratings of the apps.
Feature Selection for Sentiment Classification Using Matrix Factorization BIBAFull-Text 63-64
  Jiguang Liang; Xiaofei Zhou; Li Guo; Shuo Bai
Feature selection is a critical task in both sentiment classification and topical text classification. However, most existing feature selection algorithms ignore a significant contextual difference between them that sentiment classification is commonly depended more on the words conveying sentiments. Based on this observation, a new feature selection method based on matrix factorization is proposed to identify the words with strong inter-sentiment distinguish-ability and intra-sentiment similarity. Furthermore, experiments show that our models require less features while still maintaining reasonable classification accuracy.
Inferring and Exploiting Categories for Next Location Prediction BIBAFull-Text 65-66
  Ankita Likhyani; Deepak Padmanabhan; Srikanta Bedathur; Sameep Mehta
Predicting the next location of a user based on their previous visiting pattern is one of the primary tasks over data from location based social networks (LBSNs) such as Foursquare. Many different aspects of these so-called "check-in" profiles of a user have been made use of in this task, including spatial and temporal information of check-ins as well as the social network information of the user. Building more sophisticated prediction models by enriching these check-in data by combining them with information from other sources is challenging due to the limited data that these LBSNs expose due to privacy concerns. In this paper, we propose a framework to use the location data from LBSNs, combine it with the data from maps for associating a set of venue categories with these locations. For example, if the user is found to be checking in at a mall that has cafes, cinemas and restaurants according to the map, all these information is associated. This category information is then leveraged to predict the next checkin location by the user. Our experiments with publicly available check-in dataset show that this approach improves on the state-of-the-art methods for location prediction.
A Word Vector and Matrix Factorization Based Method for Opinion Lexicon Extraction BIBAFull-Text 67-68
  Zheng Lin; Weiping Wang; Xiaolong Jin; Jiguang Liang; Dan Meng
Automatic opinion lexicon extraction has attracted lots of attention and many methods have thus been proposed. However, most existing methods depend on dictionaries (e.g., WordNet), which confines their applicability. For instance, the dictionary based methods are unable to find domain dependent opinion words, because the entries in a dictionary are usually domain-independent. There also exist corpus-based methods that directly extract opinion lexicons from reviews. However, they heavily rely on sentiment seed words that have limited sentiment information and the context information has not been fully considered. To overcome these problems, this paper presents a word vector and matrix factorization based method for automatically extracting opinion lexicons from reviews of different domains and further identifying the sentiment polarities of the words. Experiments on real datasets demonstrate that the proposed method is effective and performs better than the state-of-the-art methods.
Collaborative Datasets Retrieval for Interlinking on Web of Data BIBAFull-Text 69-70
  Haichi Liu; Jintao Tang; Dengping Wei; Peilei Liu; Hong Ning; Ting Wang
Dataset interlinking is a great important problem in Linked Data. We consider this problem from the perspective of information retrieval in this paper, thus propose a learning to rank based framework, which combines various similarity measures to retrieve the relevant datasets for a given dataset. Specifically, inspired by the idea of collaborative filtering, an effective similarity measure called collaborative similarity is proposed. Experimental results show that the collaborative similarity measure is effective for dataset interlinking, and the learning to rank based framework can significantly increase the performance.
Contextual Query Intent Extraction for Paid Search Selection BIBAFull-Text 71-72
  Pengqi Liu; Javad Azimi; Ruofei Zhang
Paid Search algorithms play an important role in online advertising where a set of related ads is returned based on a searched query. The Paid Search algorithms mostly consist of two main steps. First, a given searched query is converted to different sub-queries or similar phrases which preserve the core intent of the query. Second, the generated sub-queries are matched to the ads bidded keywords in the data set, and a set of ads with highest utility measuring relevance to the original query are returned. The focus of this paper is optimizing the first step by proposing a contextual query intent extraction algorithm to generate sub-queries online which preserve the intent of the original query the best. Experimental results over a very large real-world data set demonstrate the superb performance of proposed approach in optimizing both relevance and monetization metrics compared with one of the existing successful algorithms in our system.
Towards Hierarchies of Search Tasks & Subtasks BIBAFull-Text 73-74
  Rishabh Mehrotra; Emine Yilmaz
Current search systems do not provide adequate support for users tackling complex tasks due to which the cognitive burden of keeping track of such tasks is placed on the searcher. As opposed to recent approaches to search task extraction, a more naturalistic viewpoint would involve viewing query logs as hierarchies of tasks with each search task being decomposed into more focussed sub-tasks. In this work, we propose an efficient Bayesian nonparametric model for extracting hierarchies of such tasks & subtasks. The proposed approach makes use of the multi-relational aspect of query associations which are important in identifying query-task associations. We describe a greedy agglomerative model selection algorithm based on the Gamma-Poisson conjugate mixture that take just one pass through the data to learn a fully probabilistic, hierarchical model of trees that is capable of learning trees with arbitrary branching structures as opposed to the more common binary structured trees. We evaluate our method based on real world query log data based on query term prediction. To the best of our knowledge, this work is the first to consider hierarchies of search tasks and subtasks.
On Topology of Baidu's Association Graph Based on General Recommendation Engine and Users' Behavior BIBAFull-Text 75-76
  Cong Men; Wanwan Tang; Po Zhang; Junqi Hou
To better meet users' underlying navigational requirement, search engines like Baidu has developed general recommendation engine and provided related entities on the right side of the search engine results page (SERP). However, users' behavior have not been well investigated after the association of individual queries in search engine. To better understand users' navigational activities, we propose a new method to map users' behavior to an association graph and make graph analysis. Interesting properties like clustering and assortativity are found in this association graph. This study provides a new perspective on research of semantic network and users' navigational behavior on SERP.
Join Size Estimation on Boolean Tensors of RDF Data BIBAFull-Text 77-78
  Saskia Metzler; Pauli Miettinen
The Resource Description Framework (RDF) represents information as subject -- predicate -- object triples. These triples are commonly interpreted as a directed labelled graph. We instead interpret the data as a 3-way Boolean tensor. Standard SPARQL queries then can be expressed using elementary Boolean algebra operations. We show how this representation helps to estimate the size of joins. Such estimates are valuable for query handling and our approach might yield more efficient implementations of SPARQL query processors.
Navigation Leads Selection Considering Navigational Value of Keywords BIBAFull-Text 79-80
  Robert Moro; Maria Bielikova
Searching a vast information space such as the Web presents a challenging task and even more so, if the domain is unknown and the character of the task is thus exploratory in its nature. We have proposed a method of exploratory navigation based on navigation leads, i.e. terms that help users to filter the information space of a digital library. In this paper, we focus on the selection of the leads considering their navigational value. We employ clustering based on topic modeling using LDA (Latent Dirichlet Allocation). We present results of a preliminary evaluation on the Annota dataset containing more than 50,000 research papers.
A Recommender System for Connecting Patients to the Right Doctors in the HealthNet Social Network BIBAFull-Text 81-82
  Fedelucio Narducci; Cataldo Musto; Marco Polignano; Marco de Gemmis; Pasquale Lops; Giovanni Semeraro
In this work we present a semantic recommender system able to suggest doctors and hospitals that best fit a specific patient profile. The recommender system is the core component of the social network named HealthNet (HN). The recommendation algorithm first computes similarities among patients, and then generates a ranked list of doctors and hospitals suitable for a given patient profile, by exploiting health data shared by the community. Accordingly, the HN user can find her most similar patients, look how they cured their diseases, and receive suggestions for solving her problem. Currently, the alpha version of HN is available only for Italian users, but in the next future we want to extend the platform to other languages. We organized three focus groups with patients, practitioners, and health organizations in order to obtain comments and suggestions. All of them proved to be very enthusiastic by using the HN platform.
The Importance of Pronouns to Sentiment Analysis: Online Cancer Survivor Network Case Study BIBAFull-Text 83-84
  Nir Ofek; Lior Rokach; Cornelia Caragea; John Yen
Online health communities are a major source for patients and their informal caregivers in the process of gathering information and seeking social support. The Cancer Survivors Network of the American Cancer Society has many users and presents a large number of user interactions with regards to coping with cancer. Sentiment analysis is an important process in understanding members' needs and concerns and the impact of users' responses on other members. It aims to determine the participants' subjective attitude and reflect their emotions. Analyzing the sentiment of posts in online health communities enables the investigation of various factors such as what affects the sentiment change and discovery of sentiment change patterns. Since each writer has his or her own personality, and temporal emotional state, behavioral traits can be reflected in the writer's writing style. Pronouns are function-words which often convey some unique styling patterns into the texts. Drawing on a lexical approach to emotions, we conduct factor analysis on the use of pronouns in self-descriptions texts. Our analysis shows that the usage of pronouns has an effect on sentiment classification. Moreover, we evaluated the use of pronouns in our domain, and found it different than standard English usage.
A Semantic Hybrid Approach for Sound Recommendation BIBAFull-Text 85-86
  Vito Claudio Ostuni; Tommaso Di Noia; Eugenio Di Sciascio; Sergio Oramas; Xavier Serra
In this work we describe a hybrid recommendation approach for recommending sounds to users by exploiting and semantically enriching textual information such as tags and sounds descriptions. As a case study we used Freesound, a popular site for sharing sound samples which counts more than 4 million registered users. Tags and textual sound descriptions are exploited to extract and link entities to external ontologies such as WordNet and DBpedia. The enriched data are eventually merged with a domain specific tagging ontology to form a knowledge graph. Based on this latter, recommendations are then computed using a semantic version of the feature combination hybrid approach. An evaluation on historical data shows improvements with respect to state of the art collaborative algorithms.
Exploring Communities for Effective Location Prediction BIBAFull-Text 87-88
  Jun Pang; Yang Zhang
Humans are social animals, they interact with different communities to conduct different activities. The literature has shown that human mobility is constrained by their social relations. In this work, we investigate the social impact on a user's mobility from his communities in order to conduct location prediction effectively. Through analysis of a real-life dataset, we demonstrate that (1) a user gets more influences from his communities than from all his friends; (2) his mobility is influenced only by a small subset of his communities; (3) influence from communities depends on social contexts. We further exploit a SVM to predict a user's future location based on his community information. Experimental results show that the model based on communities leads to more effective predictions than the one based on friends.
Investigating Factors Affecting Personal Data Disclosure BIBAFull-Text 89-90
  Christos Perentis; Michele Vescovi; Bruno Lepri
Mobile devices, sensors and social networks have dramatically increased the collection and sharing of personal and contextual information of individuals. Hence, users constantly make disclosure decisions on the basis of a difficult trade-off between using services and data protection. Understanding the factors linked to the disclosure behavior of personal information is a step forward to assist users in their decisions. In this paper, we model the disclosure of personal information and investigate their relationships not only with demographic and self-reported individual characteristics, but also with real behavior inferred from mobile phone usage. Preliminary results show that real behavior captured from mobile data relates with actual sharing behavior, providing the basis for future predictive models.
Exact Age Prediction in Social Networks BIBAFull-Text 91-92
  Bryan Perozzi; Steven Skiena
Predicting accurate demographic information about the users of information systems is a problem of interest in personalized search, ad targeting, and other related fields. Despite such broad applications, most existing work only considers age prediction as one of classification, typically into only a few broad categories.
   Here, we consider the problem of exact age prediction in social networks as one of regression. Our proposed method learns social representations which capture community information for use as covariates. In our preliminary experiments on a large real-world social network, it can predict age within 4.15 years on average, strongly outperforming standard network regression techniques when labeled data is sparse.
Aligning Multi-Cultural Knowledge Taxonomies by Combinatorial Optimization BIBAFull-Text 93-94
  Natalia Prytkova; Gerhard Weikum; Marc Spaniol
Large collections of digital knowledge have become valuable assets for search and recommendation applications. The taxonomic type systems of such knowledge bases are often highly heterogeneous, as they reflect different cultures, languages, and intentions of usage. We present a novel method to the problem of multi-cultural knowledge alignment, which maps each node of a source taxonomy onto a ranked list of most suitable nodes in the target taxonomy. We model this task as combinatorial optimization problems, using integer linear programming and quadratic programming. The quality of the computed alignments is evaluated, using large heterogeneous taxonomies about book categories.
Exploring Heterogeneity for Multi-Domain Recommendation with Decisive Factors Selection BIBAFull-Text 95-96
  Shuang Qiu; Jian Cheng; Xi Zhang; Hanqing Lu
To address the recommendation problems in the scenarios of multiple domains, in this paper, we propose a novel method, HMRec, which models both consistency and heterogeneity of users' multiple behaviors in a unified framework. Moreover, the decisive factors of each domain can also be captured by our approach successfully. Experiments on the real multi-domain dataset demonstrate the effectiveness of our model.
Crossing the Boundaries of Communities via Limited Link Injection for Information Diffusion In Social Networks BIBAFull-Text 97-98
  Dimitrios Rafailidis; Alexandros Nanopoulos
We propose a new link-injection method aiming at boosting the overall diffusion of information in social networks. Our approach is based on a diffusion-coverage score of the ability of each user to spread information over the network. Candidate links for injection are identified by a matrix factorization technique and link injection is performed by attaching links to users according to their score. We additionally perform clustering to identify communities in order to inject links that cross the boundaries of such communities. In our experiments with five real world networks, we demonstrate that our method can significantly spread the information diffusion by performing limited link injection, essential to real-world applications.
Repeat Consumption Recommendation Based on Users Preference Dynamics and Side Information BIBAFull-Text 99-100
  Dimitrios Rafailidis; Alexandros Nanopoulos
We present a Coupled Tensor Factorization model to recommend items with repeat consumption over time. We introduce a measure that captures the rate with which the preferences of each user shift over time. Repeat consumption recommendations are generated based on factorizing the coupled tensor, by weighting the importance of past user preferences according to the captured rate. We also propose a variant, where the diversity of the side information is taken into account, by higher weighting users that have more rare side information. Our experiments with real-world datasets from last.fm and MovieLens demonstrate that the proposed models outperform several baselines.
Spread it Good, Spread it Fast: Identification of Influential Nodes in Social Networks BIBAFull-Text 101-102
  Maria-Evgenia G. Rossi; Fragkiskos D. Malliaros; Michalis Vazirgiannis
Understanding and controlling spreading dynamics in networks presupposes the identification of those influential nodes that will trigger an efficient information diffusion. It has been shown that the best spreaders are the ones located in the core of the network -- as produced by the k-core decomposition. In this paper we further refine the set of the most influential nodes, showing that the nodes belonging to the best K-truss subgraph, as identified by the K-truss decomposition of the network, perform even better leading to faster and wider epidemic spreading.
Probabilistic Deduplication of Anonymous Web Traffic BIBAFull-Text 103-104
  Rishiraj Saha Roy; Ritwik Sinha; Niyati Chhaya; Shiv Saini
Cookies and log in-based authentication often provide incomplete data for stitching website visitors across multiple sources, necessitating probabilistic deduplication. We address this challenge by formulating the problem as a binary classification task for pairs of anonymous visitors. We compute visitor proximity vectors by converting categorical variables like IP addresses, product search keywords and URLs with very high cardinalities to continuous numeric variables using the Jaccard coefficient for each attribute. Our method achieves about 90% AUC and F-scores in identifying whether two cookies map to the same visitor, while providing insights on the relative importance of available features in Web analytics towards the deduplication process.
Pushing the Limits of Instance Matching Systems: A Semantics-Aware Benchmark for Linked Data BIBAFull-Text 105-106
  Tzanina Saveta; Evangelia Daskalaki; Giorgos Flouris; Irini Fundulaki; Melanie Herschel; Axel-Cyrille Ngonga Ngomo
The architectural choices behind the Data Web have led to the publication of large interrelated data sets that contain different descriptions for the same real-world objects. Due to the mere size of current online datasets, such duplicate instances are most commonly detected (semi-)automatically using instance matching frameworks. Choosing the right framework for this purpose remains tedious, as current instance matching benchmarks fail to provide end users and developers with the necessary insights pertaining to how current frameworks behave when dealing with real data. In this poster, we present the Semantic Publishing Instance Matching Benchmark (SPIMBENCH) which allows the benchmarking of instance matching systems against not only structure-based and value-based test cases, but also against semantics-aware test cases based on OWL axioms. SPIMBENCH features a scalable data generator and a weighted gold standard that can be used for debugging instance matching systems and for reporting how well they perform in various matching tasks.
Propagating Expiration Decisions in a Search Engine Result Cache BIBAFull-Text 107-108
  Fethi Burak Sazoglu; Özgür Ulusoy; Ismail Sengor Altingovde; Rifat Ozcan; Berkant Barla Cambazoglu
Detecting stale queries in a search engine result cache is an important problem. In this work, we propose a mechanism that propagates the expiration decision for a query to similar queries in the cache to re-adjust their time-to-live values.
Semantics-Driven Implicit Aspect Detection in Consumer Reviews BIBAFull-Text 109-110
  Kim Schouten; Nienke de Boer; Tjian Lam; Marijtje van Leeuwen; Ruud van Luijk; Flavius Frasincar
With consumer reviews becoming a mainstream part of e-commerce, a good method of detecting the product or service aspects that are discussed is desirable. This work focuses on detecting aspects that are not literally mentioned in the text, or implicit aspects. To this end, a co-occurrence matrix of synsets from WordNet and implicit aspects is constructed. The semantic relations that exist between synsets in WordNet are exploited to enrich the co-occurrence matrix with more contextual information. Comparing this method with a similar method which is not semantics-driven clearly shows the benefit of the proposed method. Especially corpora of limited size seem to benefit from the added semantic context.
AutoRec: Autoencoders Meet Collaborative Filtering BIBAFull-Text 111-112
  Suvash Sedhain; Aditya Krishna Menon; Scott Sanner; Lexing Xie
This paper proposes AutoRec, a novel autoencoder framework for collaborative filtering (CF). Empirically, AutoRec's compact and efficiently trainable model outperforms state-of-the-art CF techniques (biased matrix factorization, RBM-CF and LLORMA) on the Movielens and Netflix datasets.
Generating Quiz Questions from Knowledge Graphs BIBAFull-Text 113-114
  Dominic Seyler; Mohamed Yahya; Klaus Berberich
We propose an approach to generate natural language questions from knowledge graphs such as DBpedia and YAGO. We stage this in the setting of a quiz game. Our approach, though, is general enough to be applicable in other settings. Given a topic of interest (e.g., Soccer) and a difficulty (e.g., hard), our approach selects a query answer, generates a SPARQL query having the answer as its sole result, before verbalizing the question.
Measuring and Characterizing Nutritional Information of Food and Ingestion Content in Instagram BIBAFull-Text 115-116
  Sanket S. Sharma; Munmun De Choudhury
Social media sites like Instagram have emerged as popular platforms for sharing ingestion and dining experiences. However research on characterizing the nutritional information embedded in such content is limited. In this paper, we develop a computational method to extract nutritional information, specifically calorific content from Instagram food posts. Next, we explore how the community reacts specifically to healthy versus non-healthy food postings. Based on a crowdsourced approach, our method was found to detect calorific content in posts with 89% accuracy. We further show the use of Instagram as a platform where sharing of moderately healthy food content is common, and such content also receives the most support from the community.
Helping Users Understand Their Web Footprints BIBAFull-Text 117-118
  Lisa Singh; Hui Yang; Micah Sherr; Yifang Wei; Andrew Hian-Cheong; Kevin Tian; Janet Zhu; Sicong Zhang; Tavish Vaidya; Elchin Asgarli
To help users better understand the potential risks associated with publishing data publicly, and the types of data that can be inferred by combining data from multiple online sources, we introduce a novel information exposure detection framework that generates and analyzes the web footprints users leave across the social web. We propose to use probabilistic operators, free text attribute extraction, and a population-based inference engine to generate the web footprints. Evaluation over public profiles from multiple sites shows that our framework successfully detects and quantifies information exposure using a small amount of non-sensitive initial knowledge.
Detecting Concept-level Emotion Cause in Microblogging BIBAFull-Text 119-120
  Shuangyong Song; Yao Meng
In this paper, we propose a Concept-level Emotion Cause Model (CECM), instead of the mere word-level models, to discover causes of microblogging users' diversified emotions on specific hot event. A modified topic-supervised biterm topic model is utilized in CECM to detect "emotion topics" in event-related tweets, and then context-sensitive topical PageRank is utilized to detect meaningful multiword expressions as emotion causes. Experimental results on a dataset from Sina Weibo, one of the largest microblogging websites in China, show CECM can better detect emotion causes than baseline methods.
Topical Word Importance for Fast Keyphrase Extraction BIBAFull-Text 121-122
  Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
We propose an improvement on a state-of-the-art keyphrase extraction algorithm, Topical PageRank (TPR), incorporating topical information from topic models. While the original algorithm requires a random walk for each topic in the topic model being used, ours is independent of the topic model, computing but a single PageRank for each text regardless of the amount of topics in the model. This increases the speed drastically and enables it for use on large collections of text using vast topic models, while not altering performance of the original algorithm.
When Topic Models Disagree: Keyphrase Extraction with Multiple Topic Models BIBAFull-Text 123-124
  Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
We explore how the unsupervised extraction of topic-related keywords benefits from combining multiple topic models. We show that averaging multiple topic models, inferred from different corpora, leads to more accurate keyphrases than when using a single topic model and other state-of-the-art techniques. The experiments confirm the intuitive idea that a prerequisite for the significant benefit of combining multiple models is that the models should be sufficiently different, i.e., they should provide distinct contexts in terms of topical word importance.
Modeling User Activities on the Web using Paragraph Vector BIBAFull-Text 125-126
  Yukihiro Tagami; Hayato Kobayashi; Shingo Ono; Akira Tajima
Modeling user activities on the Web is a key problem for various Web services, such as news article recommendation and ad click prediction. In this paper, we propose an approach that summarizes each sequence of user activities using the Paragraph Vector, considering users and activities as paragraphs and words, respectively. The learned user representations are used among the user-related prediction tasks in common. We evaluate this approach on two data sets based on logs from Web services of Yahoo! JAPAN. Experimental results demonstrate the effectiveness of our proposed methods.
Lights, Camera, Action: Knowledge Extraction from Movie Scripts BIBAFull-Text 127-128
  Niket Tandon; Gerhard Weikum; Gerard de Melo; Abir De
With the success of large knowledge graphs, research on automatically acquiring commonsense knowledge is revived. One kind of knowledge that has not received attention is that of human activities. This paper presents an information extraction pipeline for systematically distilling activity knowledge from a corpus of movie scripts. Our semantic frames capture activities together with their participating agents and their typical spatial, temporal and sequential contexts. The resulting knowledge base comprises about 250,000 activities with links to specific movie scenes where they occur.
Assessing the Reliability of Facebook User Profiling BIBAFull-Text 129-130
  Thomas Theodoridis; Symeon Papadopoulos; Yiannis Kompatsiaris
User profiling is an essential component of most modern online services offered upon user registration. Profiling typically involves the tracking and processing of users' online traces (e.g., page views/clicks) with the goal of inferring attributes of interest for them. The primary motivation behind profiling is to improve the effectiveness of advertising by targeting users with appropriately selected ads based on their profile attributes, e.g., interests, demographics, etc. Yet, there has been an increasing number of cases, where the advertising content users are exposed to is either irrelevant or not possible to explain based on their online activities. More disturbingly, automatically inferred user attributes are often used to make real-world decisions (e.g., job candidate selection) without the knowledge of users. We argue that many of these errors are inherent in the underlying user profiling process. To this end, we attempt to quantify the extent of such errors, focusing on a dataset of Facebook users and their likes, and conclude that profiling-based targeting is highly unreliable for a sizeable subset of users.
Modelling Time-aware Search Tasks for Search Personalisation BIBAFull-Text 131-132
  Thanh Tien Vu; Alistair Willis; Dawei Song
Recent research has shown that mining and modelling search tasks helps improve the performance of search personalisation. Some approaches have been proposed to model a search task using topics discussed in relevant documents, where the topics are usually obtained from human-generated online ontology such as Open Directory Project. A limitation of these approaches is that many documents may not contain the topics covered in the ontology. Moreover, the previous studies largely ignored the dynamic nature of the search task; with the change of time, the search intent and user interests may also change.
   This paper addresses these problems by modelling search tasks with time-awareness using latent topics, which are automatically extracted from the task's relevance documents by an unsupervised topic modelling method (i.e., Latent Dirichlet Allocation). In the experiments, we utilise the time-aware search task to re-rank result list returned by a commercial search engine and demonstrate a significant improvement in the ranking quality.
Rethink Targeting: Detect 'Smart Cheating' in Online Advertising through Causal Inference BIBAFull-Text 133-134
  Pengyuan Wang; Dawei Yin; Jian Yang; Yi Chang; Marsha Meytlis
In online advertising, one of the central questions of ad campaign assessment is whether the ad truly adds values to the advertisers. To measure the incremental effect of ads, the ratio of the success rates of the users who were and who were not exposed to ads is usually calculated to represent ad effectiveness. Many existing campaigns simply target the users with high predicted success (e.g. purchases, searches) rate, which often neglect the fact that even without ad exposure, the targeted group of users might still perform the success actions, and hence show higher ratio than the true ad effectiveness. We call such phenomena 'smart cheating'. Failure to discount smart cheating when assessing ad campaigns may favor the targeting plan that cheats hard, but such targeting does not lead to the maximal incremental success actions and results in wasted budget. In this paper we define and quantify smart cheating with a smart cheating ratio (SCR) through causal inference. We apply our approach to multiple real ad campaigns, and find that smart cheating exists extensively and can be rather severe in current advertising industry.
Questions vs. Queries in Informational Search Tasks BIBAFull-Text 135-136
  Ryen W. White; Matthew Richardson; Wen-tau Yih
Search systems traditionally require searchers to formulate information needs as keywords rather than in a more natural form, such as questions. Recent studies have found that Web search engines are observing an increase in the fraction of queries phrased as natural language. As part of building better search engines, it is important to understand the nature and prevalence of these intentions, and the impact of this increase on search engine performance. In this work, we show that while 10.3% of queries issued to a search engine have direct question intent, only 3.2% of them are formulated as natural language questions. We investigate whether search engines perform better when search intent is stated as queries or questions, and we find that they perform equally well to both.
Why Do You Follow Him?: Multilinear Analysis on Twitter BIBAFull-Text 137-138
  Yuto Yamaguchi; Mitsuo Yoshida; Christos Faloutsos; Hiroyuki Kitagawa
Why does Smith follow Johnson on Twitter? In most cases, the reason why users follow other users is unavailable. In this work, we answer this question by proposing TagF, which analyzes the who-follows-whom network (matrix) and the who-tags-whom network (tensor) simultaneously. Concretely, our method decomposes a coupled tensor constructed from these matrix and tensor. The experimental results on million-scale Twitter networks show that TagF uncovers different, but explainable reasons why users follow other users.
Topic-aware Social Influence Minimization BIBAFull-Text 139-140
  Qipeng Yao; Ruisheng Shi; Chuan Zhou; Peng Wang; Li Guo
In this paper, we address the problem of minimizing the negative influence of undesirable things in a network by blocking a limited number of nodes from a topic modeling perspective. When undesirable thing such as a rumor or an infection emerges in a social network and part of users have already been infected, our goal is to minimize the size of ultimately infected users by blocking k nodes outside the infected set. We first employ the HDP-LDA and KL divergence to analysis the influence and relevance from a topic modeling perspective. Then two topic-aware heuristics based on betweenness and out-degree for finding approximate solutions to this problem are proposed. Using two real networks, we demonstrate experimentally the high performance of the proposed models and learning schemes.
Topic-aware Source Locating in Social Networks BIBAFull-Text 141-142
  Wenyu Zang; Chuan Zhou; Li Guo; Peng Zhang
In this paper we address the problem of source locating in social networks from a topic modeling perspective. From the observation that the topic factor can help infer the propagation paths, we propose a topic-aware source locating method based on topic analysis of propagation items and participants. We evaluate our algorithm on both generated and real-world datasets. The experimental results show significant improvement over existing popular methods.
Towards Entity Correctness, Completeness and Emergence for Entity Recognition BIBAFull-Text 143-144
  Lei Zhang; Yunpeng Dong; Achim Rettinger
Linking words or phrases in unstructured text to entities in knowledge bases is the problem of entity recognition and disambiguation. In this paper, we focus on the task of entity recognition in Web text to address the challenges of entity correctness, completeness and emergence that existing approaches mainly suffer from. Experimental results show that our approach significantly outperforms the state-of-the-art approaches in terms of precision, F-measure, micro-accuracy and macro-accuracy, while still preserving high recall.
Identifying Regrettable Messages from Tweets BIBAFull-Text 145-146
  Lu Zhou; Wenbo Wang; Keke Chen
Inappropriate tweets may cause severe damages on the authors' reputation or privacy. However, many users do not realize the potential damages when publishing such tweets. Published tweets have lasting effects that may not be completely eliminated by simple deletion, because other users may have read them or third-party tweet analysis platforms have cached them. In this paper, we study the problem of identifying regrettable tweets from normal individual users, with the ultimate goal of reducing the occurrences of regrettable tweets. We explore the contents of a set of tweets deleted by sample normal users to understand the regrettable tweets. With a set of features describing the identifiable reasons, we can develop classifiers to effectively distinguish such regrettable tweets from normal tweets.


Who are the American Vegans related to Brad Pitt?: Exploring Related Entities BIBAFull-Text 151-154
  Nitish Aggarwal; Kartik Asooja; Housam Ziad; Paul Buitelaar
In this demo, we present Entity Relatedness Graph (EnRG), a focused related entities explorer, which provides the users with a dynamic set of filters and facets. It gives a ranked lists of related entities to a given entity, and clusters them using the different filters. For instance, using EnRG, one can easily find the American vegans related to Brad Pitt or Irish universities related to Semantic Web. Moreover, EnRG helps a user in discovering the provenance for implicit relations between two entities. EnRG uses distributional semantics to obtain the relatedness scores between two entities.
TeMex: The Web Template Extractor BIBAFull-Text 155-158
  Julián Alarte; David Insa; Josep Silva; Salvador Tamarit
This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a predefined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism increases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation.
Roomba: Automatic Validation, Correction and Generation of Dataset Metadata BIBAFull-Text 159-162
  Ahmad Assaf; Aline Senart; Raphaël Troncy
Data is being published by both the public and private sectors and covers a diverse set of domains ranging from life sciences to media or government data. An example is the Linked Open Data (LOD) cloud which is potentially a gold mine for organizations and individuals who are trying to leverage external data sources in order to produce more informed business decisions. Considering the significant variation in size, the languages used and the freshness of the data, one realizes that spotting spam datasets or simply finding useful datasets without prior knowledge is increasingly complicated. In this paper, we propose Roomba, a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles. While Roomba is generic, we target CKAN-based data portals and we validate our approach against a set of open data portals including the Linked Open Data (LOD) cloud as viewed on the DataHub. The results demonstrate that the general state of various datasets and groups, including the LOD cloud group, needs more attention as most of the datasets suffer from bad quality metadata and lack some informative metrics that are required to facilitate dataset search.
AutoTag 'n Search My Photos: Leveraging the Social Graph for Photo Tagging BIBAFull-Text 163-166
  Shobana Balakrishnan; Surajit Chaudhuri; Vivek Narasayya
Personal photo collections are large and growing rapidly. Today, it is difficult to search such a photo collection for people who occur in them since it is tedious to manually associate face tags in photos. The key idea is to learn face models for friends and family of the user using tagged photos in a social graph such as Facebook as training examples. These face models are then used to automatically tag photos in the collection, thereby making it more searchable and easier to organize. To illustrate this idea we have developed a Windows app called AutoTag 'n Search My Photos. In this demo paper we describe the architecture, user interaction and controls, and our initial learnings from deploying the app.
Two New Gestures to Zoom: Enhancing Online Maps Services BIBAFull-Text 167-170
  Alessio Bellino
Online services such as Google Maps or Open Street Maps allow the exploration of maps on smartphones and tablets. The gestures used are the pinch to adjust the zoom level and the drag to move the map. In this paper, two new gestures to adjust the zoom level of maps are presented. Both gestures with slight differences allow the identification of a target area to zoom, which is enlarged automatically up to cover the whole map container. The proposed gestures are added to the traditional ones (drag, pinch and flick) without any overlap. Therefore, users do not need to change their regular practices. They have just two more options to control the zoom level. One of the most relevant and appreciated advantages has to do with the gesture for smartphones (Tap&Tap): this allows users to control the zoom level with just one hand. The traditional pinch gesture, instead, needs two hands. According to the test results on new gestures in comparison with the traditional pinch, 30% of time is saved on tablets (Two-Finger-Tap gesture) whereas 14% on smartphones (Tap&Tap gesture).
Champagne: A Web Tool for the Execution of Crowdsourcing Campaigns BIBAFull-Text 171-174
  Carlo Bernaschina; Ilio Catallo; Piero Fraternali; Davide Martinenghi; Marco Tagliasacchi
We present Champagne, a web tool for the execution of crowdsourcing campaigns. Through Champagne, task requesters can model crowdsourcing campaigns as a sequence of choices regarding different, independent crowdsourcing design decisions. Such decisions include, e.g., the possibility of qualifying some workers as expert reviewers, or of combining different quality assurance techniques to be used during campaign execution. In this regard, a walkthrough example showcasing the capabilities of the platform is reported. Moreover, we show that our modular approach in the design of campaigns overcomes many of the limitations exposed by the major platforms available in the market.
Social Glass: A Platform for Urban Analytics and Decision-making Through Heterogeneous Social Data BIBAFull-Text 175-178
  Stefano Bocconi; Alessandro Bozzon; Achilleas Psyllidis; Christiaan Titos Bolivar; Geert-Jan Houben
This demo presents Social Glass, a novel web-based platform that supports the analysis, valorisation, integration, and visualisation of large-scale and heterogeneous urban data in the domains of city planning and decision-making. The platform systematically combines publicly available social datasets from municipalities together with social media streams (e.g. Twitter, Instagram and Foursquare) and resources from knowledge repositories. It further enables the mapping of demographic information, human movement patterns, place popularity, traffic conditions, as well as citizens' and visitors' opinions and preferences with regard to specific venues in the city. Social Glass will be demonstrated through several real-world case studies, that exemplify the framework's conceptual properties, and its potential value as a solution for urban analytics and city-scale event monitoring and assessment.
Browser Record and Replay as a Building Block for End-User Web Automation Tools BIBAFull-Text 179-182
  Sarah Chasins; Shaon Barman; Rastislav Bodik; Sumit Gulwani
To build a programming by demonstration (PBD) web scraping tool for end users, one needs two central components: a list finder, and a record and replay tool. A list finder extracts logical tables from a webpage. A record and replay (R+R) system records a user's interactions with a webpage, and replays them programmatically. The research community has invested substantial work in list finding -- variously called wrapper induction, structured data extraction, and template detection. In contrast, researchers largely considered the browser R+R problem solved until recently, when webpage complexity and interactivity began to rise. We argue that the increase in interactivity necessitates the use of new, more robust R+R approaches, which will facilitate the PBD web tools of the future. Because robust R+R is difficult to build and understand, we argue that tool developers need an R+R layer that they can treat as a black box. We have designed an easy-to-use API that allows programmers to use and even customize R+R, without having to understand R+R internals. We have instantiated our API in Ringer, our robust R+R tool. We use the API to implement WebCombine, a PBD scraping tool. A WebCombine user demonstrates how to collect the first row of a relational dataset, and the tool collects all remaining rows. WebCombine uses the Ringer API to handle navigation between pages, enabling users to scrape from modern, interaction-heavy pages. We demonstrate WebCombine by collecting a 3,787,146 row dataset from Google Scholar that allows us to explore the relationship between researchers' years of experience and their papers' citation counts.
DiagramFlyer: A Search Engine for Data-Driven Diagrams BIBAFull-Text 183-186
  Zhe Chen; Michael Cafarella; Eytan Adar
A large amount of data is available only through data-driven diagrams such as bar charts and scatterplots. These diagrams are stylized mixtures of graphics and text and are the result of complicated data-centric production pipelines. Unfortunately, neither text nor image search engines exploit these diagram-specific properties, making it difficult for users to find relevant diagrams in a large corpus. In response, we propose DiagramFlyer, a search engine for finding data-driven diagrams on the web. By recovering the semantic roles of diagram components (e.g., axes, labels, etc.), we provide faceted indexing and retrieval for various statistical diagrams. A unique feature of DiagramFlyer is that it is able to "expand" queries to include not only exactly matching diagrams, but also diagrams that are likely to be related in terms of their production pipelines. We demonstrate the resulting search system by indexing over 300k images pulled from over 150k PDF documents.
Smith Search: Opinion-Based Restaurant Search Engine BIBAFull-Text 187-190
  Jaehoon Choi; Donghyeon Kim; Donghee Choi; Sangrak Lim; Seongsoon Kim; Jaewoo Kang; Youngjae Choi
Search engines have become an important decision-making tool today. Unfortunately, they still need to improve in answering complex queries. The answers to complex decision-making queries such as "best burgers and fries" and "good restaurants for anniversary dinner," are often subjective. The most relevant answer to the query can be obtained by only collecting people's opinions about the query, which are expressed in various venues on the Web. Collected opinions are converted into a "consensus" list. All of this should be processed at query time, which is impossible under the current search paradigm. To address this problem, we introduce Smith, a novel opinion-based restaurant search engine. Smith actively processes opinions on the Web, blogs, review boards, and other forms of social media at index time, and produces consensus answers from opinions at query time. The Smith search app (iOS) is available for download at http://www.smithsearches.com/introduction/.
whoVIS: Visualizing Editor Interactions and Dynamics in Collaborative Writing Over Time BIBAFull-Text 191-194
  Fabian Flöck; Maribel Acosta
The visualization of editor interaction dynamics and provenance of content in revisioned, collaboratively written documents has the potential to allow for more transparency and intuitive understanding of the intricate mechanisms inherent to collective content production. Although approaches exist to build editor interactions from individual word changes in Wikipedia articles, they do not allow to inquire into individual interactions, and have yet to be implemented as usable end-user tools.
   We thus present whoVIS, a web tool to mine and visualize editor interactions in Wikipedia over time. whoVIS integrates novel features with existing methods, tailoring them to the use case of understanding intra-article disagreement between editors. Using real Wikipedia examples, our system demonstrates the combination of various visualization techniques to identify different social dynamics and explore the evolution of an article that would be particularly hard for end-users to investigate otherwise.
VizCurator: A Visual Tool for Curating Open Data BIBAFull-Text 195-198
  Bahar Ghadiri Bashardoost; Christina Christodoulakis; Soheil Hassas Yeganeh; Renée J. Miller; Kelly Lyons; Oktie Hassanzadeh
Vizcurator permits the exploration, understanding and curation of open RDF data, its schema, and how it has been linked to other sources. We provide visualizations that enable one to seamlessly navigate through RDFS and RDF layers and quickly understand the open data, how it has been mapped or linked, how it has been structured (and could be restructured), and how deeply it has been related to other open data sources. More importantly, Vizcurator provides a rich set of tools for data curation. It suggests possible improvements to the structure of the data and enables curators to make informed decisions about enhancements to the exploration and exploitation of the data. Moreover, Vizcurator facilitates the mining of temporal resources and the definition of temporal constraints through which the curator can identify conflicting facts. Finally, Vizcurator can be used to create new binary temporal relations by reifying base facts and linking them to temporal resources. We will demonstrate Vizcurator using LinkedCT.org, a five-star open data set mapped from the XML NIH clinical trials data (clinicaltrials.gov) that we have been maintaining and curating for several years.
queryCategorizr: A Large-Scale Semi-Supervised System for Categorization of Web Search Queries BIBAFull-Text 199-202
  Mihajlo Grbovic; Nemanja Djuric; Vladan Radosavljevic; Narayan Bhamidipati; Jordan Hawker; Caleb Johnson
Understanding interests expressed through user's search query is a task of critical importance for many internet applications. To help identify user interests, web engines commonly utilize classification of queries into one or more pre-defined interest categories. However, majority of the queries are noisy short texts, making accurate classification a challenging task. In this demonstration, we present queryCategorizr, a novel semi-supervised learning system that embeds queries into low-dimensional vector space using a neural language model applied on search log sessions, and classifies them into general interest categories while relying on a small set of labeled queries. Empirical results on large-scale data show that queryCategorizr outperforms the current state-of-the-art approaches. In addition, we describe a Graphical User Interface (GUI) that allows users to query the system and explore classification results in an interactive manner.
DIVINA: Discovering Vulnerabilities of Internet Accounts BIBAFull-Text 203-206
  Ziad Ismail; Danai Symeonidou; Fabian Suchanek
Internet users typically have several online accounts -- such as mail accounts, cloud storage accounts, or social media accounts. The security of these accounts is often intricately linked: The password of one account can be reset by sending an email to another account; the data of one account can be backed up on another account; one account can only be accessed by two-factor authentication through a second account; and so forth. This poses three challenges: First, if a user loses one or several of his passwords, can he still access his data? Second, how many passwords does an attacker need in order to access the data? And finally, how many passwords does an attacker need in order to irreversibly delete the user's data? In this paper, we model the dependencies of online accounts in order to help the user discover security weaknesses. We have implemented our system and invite users to try it out on their real accounts.
SmartComposition: Enhanced Web Components for a Better Future of Web Development BIBAFull-Text 207-210
  Michael Krug; Martin Gaedke
In this paper, we introduce the usage of enhanced Web Components to create web applications with multi-device capabilities by composition. By using the latest developments of the family of W3C standards called "Web Components" that we extent with dedicated communication and synchronization functionality, web developers are enabled to create web applications with ease. We enhance Web Components with an event-based communication channel, which is not limited to a single browser window. With our approach, applications using the extended SmartComponents and an additional synchronization service also support multi-device scenarios. In contrast to other widget-based approaches (W3C Widgets, OpenSocial containers), the usage of SmartComponents does not require a dedicated platform, like Apache Rave. SmartComponents are based on standard web technologies, are natively supported by recent web browsers and loosely coupled using our extension. This ensures a high level of reuse. We show how SmartComponents are structured, can be created and used. Furthermore, we explain how the communication aspect is integrated and multi-device communication is achieved. Finally, we describe our demonstration by outlining two example applications.
Ajax API Self-adaptive Framework for End-to-end User BIBAFull-Text 211-214
  Xiang Li; Zhiyong Feng; Keman Huang; Shizhan Chen
Web developers often use Ajax API to build the rich Internet application (RIA). Due to the uncertainty of the environment, automatically switching among different Ajax APIs with similar functionality is important to guarantee the end-to-end performance. However, it is challenging and time-consuming because it needs to manually modify codes based on the API documentation. In this paper, we propose a framework to address the self-adaption and difficulty in invoking Ajax API. The Ajax API wrapping model, consisting of the specific and abstract components, is proposed to automatically construct the grammatical and functional semantic relations between Ajax APIs. Then switching module is introduced to support the automatic switching among different Ajax APIs, according to the user preference and QoS of Ajax APIs. Taking the map APIs, i.e. Google Map, Baidu Map, Gaode Map, 51 Map and Tencent Map as an example, the demo shows that the framework can facilitate the construction of RIA and improve adaptability of the application. The process of selection and switching in the different Ajax APIs is automatic and transparent to the users.
GalaxyExplorer: Influence-Driven Visual Exploration of Context-Specific Social Media Interactions BIBAFull-Text 215-218
  Xiaotong Liu; Srinivasan Parthasarathy; Han-Wei Shen; Yifan Hu
The ever-increasing size and complexity of social networks place a fundamental challenge to visual exploration and analysis tasks. In this paper, we present GalaxyExplorer, an influence-driven visual analysis system for exploring users of various influence and analyzing how they influence others in a social network. GalaxyExplorer reduces the size and complexity of a social network by dynamically retrieving theme-based graphs, and analyzing users' influence and passivity regarding specific themes and dynamics in response to disaster events. In GalaxyExplorer, a galaxy-based visual metaphor is introduced to simplify the visual complexity of a large graph with a focus+context view. Various interactions are supported for visual exploration. We present experimental results on real-world datasets that show the effectiveness of GalaxyExplorer in theme-aware influence analysis.
CubeViz: Exploration and Visualization of Statistical Linked Data BIBAFull-Text 219-222
  Michael Martin; Konrad Abicht; Claus Stadler; Axel-Cyrille Ngonga Ngomo; Tommaso Soru; Sören Auer
CubeViz is a flexible exploration and visualization platform for statistical data represented adhering to the RDF Data Cube vocabulary. If statistical data is provided adhering to the Data Cube vocabulary, CubeViz exhibits a faceted browsing widget allowing to interactively filter observations to be visualized in charts. Based on the selected structural part, CubeViz offers suitable chart types and options for configuring the visualization by users. In this demo we present the CubeViz visualization architecture and components, sketch its underlying API and the libraries used to generate the desired output. By employing advanced introspection, analysis and visualization bootstrapping techniques CubeViz hides the schema complexity of the encoded data in order to support a user-friendly exploration experience.
EXPOSÉ: EXploring Past news fOr Seminal Events BIBAFull-Text 223-226
  Arunav Mishra; Klaus Berberich
Recent increases in digitization and archiving efforts on news data have led to overwhelming amounts of online information for general users, thus making it difficult for them to retrospect on past events. One dimension along which past events can be effectively organized is time. Motivated by this idea, we introduce EXPOSÉ, an exploratory search system that explicitly uses temporal information associated with events to link different kinds of information sources for effective exploration of past events. In this demonstration, we use Wikipedia and news articles as two orthogonal sources. Wikipedia is viewed as an event directory that systematically lists seminal events in a year; news articles are viewed as a source of detailed information on each of these events. To this end, our demo includes several time-aware retrieval approaches that a user can employ for retrieving relevant news articles, as well as a timeline tool for temporal analysis and entity-based facets for filtering results.
A Serious Game Powered by Semantic Web technologies BIBAFull-Text 227-230
  Bernardo Pereira Nunes; Terhi Nurmikko-Fuller; Giseli Rabello Lopes; Chiara Renso
ISCOOL is an interactive educational platform that helps users develop their skills for objective text analysis and interpretation. The tools incorporated into ISCOOL bridge various disparate sources, including reference datasets for people, and organizations, as well as gazetteers, dictionaries and collections of historical facts. This data serves as the basis for educating learners about the processes of evaluating the implicit and implied content of written material, whilst also providing a wider context in which this information is accessed, interpreted and understood. In the course of gameplay, the user is prompted to choose images that best capture content of a read passage. The interactive features of the game simultaneously test the user's existing knowledge, and ability to critically analyse the text. Results can be saved and shared, allowing the players to continue to interact with the data through conversations with their peers, friends, and family members, and to disseminate information throughout their communities. Users will be able to draw connections between the information they encounter in ISCOOL, and their daily realities -- participants are empowered, informed and educated.
Geosocial Search: Finding Places based on Geotagged Social-Media Posts BIBAFull-Text 231-234
  Barak Pat; Yaron Kanza; Mor Naaman
Geographic search -- where the user provides keywords and receives relevant locations depicted on a map -- is a popular web application. Typically, such search is based on static geographic data. However, the abundant geotagged posts in microblogs such as Twitter and in social networks like Instagram, provide contemporary information that can be used to support geosocial search -- geographic search based on user activities in social media. Such search can point out where people talk (or tweet) about different topics. For example, the search results may show where people refer to "jogging", to indicate popular jogging places. The difficulty in implementing such search is that there is no natural partition of the space into "documents" as in ordinary web search. Thus, it is not always clear how to present results and how to rank and filter results effectively. In this paper, we demonstrate a two-step process of first, quickly finding the relevant areas by using an arbitrary indexed partition of the space, and secondly, applying clustering on discovered areas, to present more accurate results. We introduce a system that utilizes geotagged posts in geographic search and illustrate how different ranking methods can be used, based on the proposed two-step search process. The system demonstrates the effectiveness and usefulness of the approach.
Extracting knowledge from text using SHELDON, a Semantic Holistic framEwork for LinkeD ONtology data BIBAFull-Text 235-238
  Diego Reforgiato Recupero; Andrea Giovanni Nuzzolese; Sergio Consoli; Valentina Presutti; Misael Mongiovì; Silvio Peroni
SHELDON is the first true hybridization of NLP machine reading and the Semantic Web. It extracts RDF data from text using a machine reader: the extracted RDF graphs are compliant to Semantic Web and Linked Data. It goes further and applies Semantic Web practices and technologies to extend the current human-readable web. The input is represented by a sentence in any language. SHELDON includes different capabilities in order to extend machine reading to Semantic Web data: frame detection, topic extraction, named entity recognition, resolution and coreference, terminology extraction, sense tagging and disambiguation, taxonomy induction, semantic role labeling, type induction, sentiment analysis, citation inference, relation and event extraction, nice visualization tools which make use of the JavaScript infoVis Toolkit and RelFinder. A demo of SHELDON can be seen and used at http://wit.istc.cnr.it/stlab-tools/sheldon.
Cloud WorkBench: Benchmarking IaaS Providers based on Infrastructure-as-Code BIBAFull-Text 239-242
  Joel Scheuner; Jürgen Cito; Philipp Leitner; Harald Gall
Optimizing the deployment of applications in Infrastructure-as-a-Service clouds requires to evaluate the costs and performance of different combinations of cloud configurations which is unfortunately a cumbersome and error-prone process. In this paper, we present Cloud WorkBench (CWB), a concrete implementation of a cloud benchmarking Web service, which fosters the definition of reusable and representative benchmarks. We demonstrate the complete cycle of benchmarking an IaaS service with the sample benchmark SysBench. In distinction to existing work, our system is based on the notion of Infrastructure-as-Code, which is a state of the art concept to define IT infrastructure in a reproducible, well-defined, and testable way.
An Overview of Microsoft Academic Service (MAS) and Applications BIBAFull-Text 243-246
  Arnab Sinha; Zhihong Shen; Yang Song; Hao Ma; Darrin Eide; Bo-June (Paul) Hsu; Kuansan Wang
In this paper we describe a new release of a Web scale entity graph that serves as the backbone of Microsoft Academic Service (MAS), a major production effort with a broadened scope to the namesake vertical search engine that has been publicly available since 2008 as a research prototype. At the core of MAS is a heterogeneous entity graph comprised of six types of entities that model the scholarly activities: field of study, author, institution, paper, venue, and event. In addition to obtaining these entities from the publisher feeds as in the previous effort, we in this version include data mining results from the Web index and an in-house knowledge base from Bing, a major commercial search engine. As a result of the Bing integration, the new MAS graph sees significant increase in size, with fresh information streaming in automatically following their discoveries by the search engine. In addition, the rich entity relations included in the knowledge base provide additional signals to disambiguate and enrich the entities within and beyond the academic domain. The number of papers indexed by MAS, for instance, has grown from low tens of millions to 83 million while maintaining an above 95% accuracy based on test data sets derived from academic activities at Microsoft Research. Based on the data set, we demonstrate two scenarios in this work: a knowledge driven, highly interactive dialog that seamlessly combines reactive search and proactive suggestion experience, and a proactive heterogeneous entity recommendation.
Time-travel Translator: Automatically Contextualizing News Articles BIBAFull-Text 247-250
  Nam Khanh Tran; Andrea Ceroni; Nattiya Kanhabua; Claudia Niederée
Fully understanding an older news article requires context knowledge from the time of article creation. Finding information about such context is a tedious and time-consuming task, which distracts the reader. Simple contextualization via Wikification is not sufficient here. The retrieved context information has to be time-aware, concise (not full Wikipages) and focused on the coherence of the article topic. In this paper, we present Contextualizer, a web-based system that acquires additional information for supporting interpretations of a news article of interest that requires a mapping, in this case, a kind of time-travel translation between present context knowledge and context knowledge at time of text creation. For a given article, the system provides a GUI that allows users to highlight their interested keywords which are then used to construct appropriate queries for retrieving contextualization candidates. Contextualizer exploits different kinds of information such as temporal similarity and textual complementarity to re-rank the candidates and presents to users in a friendly and interactive web-based interface.
Kvasir: Seamless Integration of Latent Semantic Analysis-Based Content Provision into Web Browsing BIBAFull-Text 251-254
  Liang Wang; Sotiris Tasoulis; Teemu Roos; Jussi Kangasharju
The Internet is overloading its users with excessive information flows, so that effective content-based filtering becomes crucial in improving user experience and work efficiency. We build Kvasir, a semantic recommendation system, atop latent semantic analysis and other state-of-art technologies to seamlessly integrate an automated and proactive content provision service into web browsing. We utilize the power of Apache Spark to scale up Kvasir to a practical Internet service. Herein we present the architecture of Kvasir, along with our solutions to the technical challenges in the actual system implementation.
SemMobi: A Semantic Annotation System for Mobility Data BIBAFull-Text 255-258
  Fei Wu; Hongjian Wang; Zhenhui Li; Wang-Chien Lee; Zhuojie Huang
The wide adaptation of mobile devices embedded with modern positioning technology enables the collection of valuable mobility data from users. At the same time, the large-scale user-generated data from social media, such as geo-tagged tweets, provide rich semantic information about events and locations. The combination of the mobility data and social media data brings opportunities for us to study the semantics behind people's movement, i.e., understand why a person travels to a location at a particular time. Previous work have used map or POI (point of interest) database as source for semantics. However, those semantics are static, and thus missing important dynamic event information. To provide dynamic semantic annotation, we propose to use contextual social media. More specifically, the semantics could be landmark information (e.g., a museum or an arena) or event information (e.g., sports games or concerts). The SemMobi system implements our recently developed annotation method, which has been recently accepted to WWW 2015 conference. The annotation method annotates words to each mobility records based on local density of words, estimated by Kernel Density Estimation model. The annotated mobility data contain rich and interpretable information, therefore can benefit applications, such as personalized recommendation, targeted advertisement, and movement prediction. Our system is built upon large-scale tweet datasets. A user-friendly interface is designed to support interactive exploration of the result.
Towards An Interactive Keyword Search over Relational Databases BIBAFull-Text 259-262
  Zhong Zeng; Zhifeng Bao; Mong Li Lee; Tok Wang Ling
Keyword search over relational databases has been widely studied for the exploration of structured data in a user-friendly way. However, users typically have limited domain knowledge or are unable to precisely specify their search intention. Existing methods find the minimal units that contain all the query keywords, and largely ignore the interpretation of possible users' search intentions. As a result, users are often overwhelmed with a lot of irrelevant answers. Moreover, without a visually pleasing way to present the answers, users often have difficulty understanding the answers because of their complex structures. Therefore, we design an interactive yet visually pleasing search paradigm called ExpressQ. ExpressQ extends the keyword query language to include keywords that match meta-data, e.g., names of relations and attributes. These keywords are utilized to infer users' search intention. Each possible search intention is represented as a query pattern, whose meaning is described in human natural language. Through a series of user interactions, ExpressQ can determine the search intention of the user, and translate the corresponding query patterns into SQLs to retrieve answers to the query. The ExpressQ prototype is available at http://expressq.comp.nus.edu.sg.

WebSci Track Papers & Posters

A Study of Distinctiveness in Web Results of Two Search Engines BIBAFull-Text 267-273
  Rakesh Agrawal; Behzad Golshan; Evangelos Papalexakis
Google and Bing have emerged as the diarchy that arbitrates what documents are seen by Web searchers, particularly those desiring English language documents. We seek to study how distinctive are the top results presented to the users by the two search engines. A recent eye-tracking has shown that the web searchers decide whether to look at a document primarily based on the snippet and secondarily on the title of the document on the web search result page, and rarely based on the URL of the document. Given that the snippet and title generated by different search engines for the same document are often syntactically different, we first develop tools appropriate for conducting this study. Our empirical evaluation using these tools shows a surprising agreement in the results produced by the two engines for a wide variety of queries used in our study. Thus, this study raises the open question whether it is feasible to design a search engine that would produce results distinct from those produced by Google and Bing that the users will find helpful.
Correlation of Node Importance Measures: An Empirical Study through Graph Robustness BIBAFull-Text 275-281
  Mirza Basim Baig; Leman Akoglu
Graph robustness is a measure of resilience to failures and targeted attacks. A large body of research on robustness focuses on how to attack a given network by deleting a few nodes so as to maximally disrupt its connectedness. As a result, literature contains a myriad of attack strategies that rank nodes by their relative importance for this task. How different are these strategies? Do they pick similar sets of target nodes, or do they differ significantly in their choices? In this paper, we perform the first large scale empirical correlation analysis of attack strategies, i.e., the node importance measures that they employ, for graph robustness. We approach this task in three ways; by analyzing similarities based on (i) their overall ranking of the nodes, (ii) the characteristics of top nodes that they pick, and (iii) the dynamics of disruption that they cause on the network. Our study of 15 different (randomized, local, distance-based, and spectral) strategies on 68 real-world networks reveals surprisingly high correlations among node-attack strategies, consistent across all three types of analysis, and identifies groups of comparable strategies. These findings suggest that some computationally complex strategies can be closely approximated by simpler ones, and a few strategies can be used as a close proxy of the consensus among all of them.
Investigating Similarity Between Privacy Policies of Social Networking Sites as a Precursor for Standardization BIBAFull-Text 283-289
  Emma Cradock; David Millard; Sophie Stalla-Bourdillon
The current execution of privacy policies, as a mode of communicating information to users, is unsatisfactory. Social networking sites (SNS) exemplify this issue, attracting growing concerns regarding their use of personal data and its effect on user privacy. This demonstrates the need for more informative policies. However, SNS lack the incentives required to improve policies, which is exacerbated by the difficulties of creating a policy that is both concise and compliant. Standardization addresses many of these issues, providing benefits for users and SNS, although it is only possible if policies share attributes which can be standardized. This investigation used thematic analysis and cross-document structure theory, to assess the similarity of attributes between the privacy policies (as available in August 2014), of the six most frequently visited SNS globally. Using the Jaccard similarity coefficient, two types of attribute were measured; the clauses used by SNS and the coverage of forty recommendations made by the UK Information Commissioner's Office. Analysis showed that whilst similarity in the clauses used was low, similarity in the recommendations covered was high, indicating that SNS use different clauses, but to convey similar information. The analysis also showed that low similarity in the clauses was largely due to differences in semantics, elaboration and functionality between SNS. Therefore, this paper proposes that the policies of SNS already share attributes, indicating the feasibility of standardization and five recommendations are made to begin facilitating this, based on the findings of the investigation.
TopChurn: Maximum Entropy Churn Prediction Using Topic Models Over Heterogeneous Signals BIBAFull-Text 291-297
  Manirupa Das; Micha Elsner; Arnab Nandi; Rajiv Ramnath
With the onset of social media and news aggregators on the Web, the newspaper industry is faced with a declining subscriber base. In order to retain customers both on-line and in print, it is therefore critical to predict and mitigate customer churn. Newspapers typically have heterogeneous sources of valuable data: circulation data, customer subscription information, news content, and search click log data. An ensemble of predictive models over multiple sources faces unique challenges -- ascertaining short-term versus long-term effects of features on churn, and determining mutual information properties across multiple data sources. We present TopChurn, a novel system that uses topic models as a means of extracting dominant features from user complaints and Web data for churn prediction. TopChurn uses a maximum entropy-based approach to identify features that are most indicative of subscribers likely to drop subscription within a specified period of time. We conduct temporal analyses to determine long-term versus short-term effects of status changes on subscriber accounts, included in our temporal models of churn; and topic and sentiment analyses on news and clicklogs, included in our Web models of churn. We then validate our insights via experiments over real data from The Columbus Dispatch, a mainstream daily newspaper, and demonstrate that our churn models significantly outperform baselines for various prediction windows.
Deep Feelings: A Massive Cross-Lingual Study on the Relation between Emotions and Virality BIBAFull-Text 299-305
  Marco Guerini; Jacopo Staiano
This article provides a comprehensive investigation on the relations between virality of news articles and the emotions they are found to evoke. Virality, in our view, is a phenomenon with many facets, i.e. under this generic term several different effects of persuasive communication are comprised. By exploiting a high-coverage and bilingual corpus of documents containing metrics of their spread on social networks as well as a massive affective annotation provided by readers, we present a thorough analysis of the interplay between evoked emotions and viral facets. We highlight and discuss our findings in light of a cross-lingual approach: while we discover differences in evoked emotions and corresponding viral effects, we provide preliminary evidence of a generalized explanatory model rooted in the deep structure of emotions: the Valence-Arousal-Dominance (VAD) circumplex. We find that viral facets appear to be consistently affected by particular VAD configurations, and these configurations indicate a clear connection with distinct phenomena underlying persuasive communication.
User Behavior Characterization of a Large-scale Mobile Live Streaming System BIBAFull-Text 307-313
  Zhenyu Li; Gaogang Xie; Mohamed Ali Kaafar; Kave Salamatian
Streaming live content to mobile terminals has become prevalent. While there are extensive measurement studies of non-mobile live streaming (and in particular P2P live streaming) and video-on-demand (both mobile and non-mobile), user behavior in mobile live streaming systems is yet to be explored. This paper relies on over 4 million access logs collected from the PPTV live streaming system to study the viewing behavior and user activity pattern, with emphasis on the discrepancies that might exist when users access the live streaming system catalog from mobile and non-mobile terminals. We observe high rates of abandoned viewing sessions for mobile users and identify different reasons of that behavior for 3G- and WiFi-based views. We further examine the structure of abandoned sessions due to connection performance issues from the perspectives of time of day and mobile device types. To understand the user pattern, we analyze user activity distribution, user geographical distribution as well as user arrival/departure rates.
Identity Management and Mental Health Discourse in Social Media BIBAFull-Text 315-321
  Umashanthi Pavalanathan; Munmun De Choudhury
Social media is increasingly being adopted in health discourse. We examine the role played by identity in supporting discourse on socially stigmatized conditions. Specifically, we focus on mental health communities on reddit. We investigate the characteristics of mental health discourse manifested through reddit's characteristic 'throwaway' accounts, which are used as proxies of anonymity. For the purpose, we propose affective, cognitive, social, and linguistic style measures, drawing from literature in psychology. We observe that mental health discourse from throwaways is considerably disinhibiting and exhibits increased negativity, cognitive bias and self-attentional focus, and lowered self-esteem. Throwaways also seem to be six times more prevalent as an identity choice on mental health forums, compared to other reddit communities. We discuss the implications of our work in guiding mental health interventions, and in the design of online communities that can better cater to the needs of vulnerable populations. We conclude with thoughts on the role of identity manifestation on social media in behavioral therapy.
"Roles for the Boys?": Mining Cast Lists for Gender and Role Distributions over Time BIBAFull-Text 323-329
  Will Radford; Matthias Gallé
Film and television play an important role in popular culture. However studies that require watching and annotating video are time-consuming and expensive to run at scale. We explore information mined from media database cast lists to explore the evolution of different roles over time. We focus on the gender distribution of those roles and how this changes over time. Finally, we compare real-life census gender distributions to our web-mediated onscreen gender data. We propose these methodologies are a useful adjunct to traditional analysis that allow researchers to explore the relationship between online and onscreen gender depictions.
Improving Productivity in Citizen Science through Controlled Intervention BIBAFull-Text 331-337
  Avi Segal; Ya'akov (Kobi) Gal; Robert J. Simpson; Victoria Victoria Homsy; Mark Hartswood; Kevin R. Page; Marina Jirotka
The majority of volunteers participating in citizen science projects perform only a few tasks each before leaving the system. We designed an intervention strategy to reduce disengagement in 16 different citizen science projects. Targeted users who had left the system received emails that directly addressed motivational factors that affect their engagement. Results show that participants receiving the emails were significantly more likely to return to productive activity when compared to a control group.
Attention Please! A Hybrid Resource Recommender Mimicking Attention-Interpretation Dynamics BIBAFull-Text 339-345
  Paul Seitlinger; Dominik Kowald; Simone Kopeinik; Ilire Hasani-Mavriqi; Elisabeth Lex; Tobias Ley
Classic resource recommenders like Collaborative Filtering (CF) treat users as being just another entity, neglecting non-linear user-resource dynamics shaping attention and interpretation. In this paper, we propose a novel hybrid recommendation strategy that refines CF by capturing these dynamics. The evaluation results reveal that our approach substantially improves CF and, depending on the dataset, successfully competes with a computationally much more expensive Matrix Factorization variant.
Crowdsourcing the Annotation of Rumourous Conversations in Social Media BIBAFull-Text 347-353
  Arkaitz Zubiaga; Maria Liakata; Rob Procter; Kalina Bontcheva; Peter Tolmie
Social media are frequently rife with rumours, and the study of rumour conversational aspects can provide valuable knowledge about how rumours evolve over time and are discussed by others who support or deny them. In this work, we present a new annotation scheme for capturing rumour-bearing conversational threads, as well as the crowdsourcing methodology used to create high quality, human annotated datasets of rumourous conversations from social media. The rumour annotation scheme is validated through comparison between crowdsourced and reference annotations. We also found that only a third of the tweets in rumourous conversations contribute towards determining the veracity of rumours, which reinforces the need for developing methods to extract the relevant pieces of information automatically.
Viral Misinformation: The Role of Homophily and Polarization BIBFull-Text 355-356
  Alessandro Bessi; Fabio Petroni; Michela Del Vicario; Fabiana Zollo; Aris Anagnostopoulos; Antonio Scala; Guido Caldarelli; Walter Quattrociocchi
Modelling Question Selection Behaviour in Online Communities BIBAFull-Text 357-358
  Grégoire Burel; Paul Mulholland; Yulan He; Harith Alani
Value of online Question Answering (Q&A) communities is driven by the question-answering behaviour of its members. Finding the questions that members are willing to answer is therefore vital to the efficient operation of such communities. In this paper, we aim to identify the parameters that correlate with such behaviours. We train different models and construct effective predictions using various user, question and thread feature sets. We show that answering behaviour can be predicted with a high level of success.
Linked Ethnographic Data: From Theory to Practice BIBAFull-Text 359-360
  Dominic DiFranzo; Marie Joan Kristine Gloria; James Hendler
As Web Science continues to mix methods from the many disciplines that study the web, we must begin to seriously look at mixing and linking data across the Qualitative and Quantitative divide. A large difficulty in this is in modeling and archiving Qualitative data. In this paper, we outline what these difficulties are in detail with a focus on the data practices of Ethnography. We describe how linked data technologies can address these issues. We demonstrate this with a case study in modeling data from audio interviews that were taken in an ethnographic study conducted in our lab. We conclude with a discussion on future work that needs to be done to better equip researchers with these tools and methods.
Social Networking by Proxy: Analysis of Dogster, Catster and Hamsterster BIBAFull-Text 361-362
  Daniel Dünker; Jérôme Kunegis
Online pet social networks provide a unique opportunity to study an online social network in which a single user manages multiple user profiles, i.e. one for each pet they own. These types of multi-profile networks allow us to investigate two questions: (1) What is the relationship between the pet-level and human-level network, and (2) what is the relationship between friendship links and family ties? Concretely, we study the online pet social networks Catster, Dogster and Hamsterster, and show how the networks on the two levels interact, and perform experiments to find out whether knowledge about friendships on a profile-level alone can be used to predict which users are behind which profile.
Web as Corpus Supporting Natural Language Generation for Online River Information Communication BIBAFull-Text 363-364
  Xiwu Han; Antonio A. R. Ioris; Chenghua Lin
Web as corpus for NLP has been popular, and we now employed web as corpus for NLG, and made the online communication of tailored river information more effective and efficient. Evaluation and analysis shows that our generated texts were comparable to those written by domain experts and experienced users.
Modeling the Evolution of User-generated Content on a Large Video Sharing Platform BIBAFull-Text 365-366
  Rishabh Mehrotra; Prasanta Bhattacharya
Video sharing and entertainment websites have rapidly grown in popularity and now constitute some of the most visited websites on the Internet. Despite the high usage and user engagement, most of recent research on online media platforms have restricted themselves to networking based social media sites like Facebook or Twitter. The current study is among the first to perform a large-scale empirical study using longitudinal video upload data from one of the largest online video sites. Unlike previous studies in the online media space that have focused exclusively on demand-side research questions, we model the supply-side of the crowd contributed video ecosystem on this platform. The modeling and subsequent prediction of video uploads is made complicated by the heterogeneity of video types (e.g. popular vs. niche video genres), and the inherent time trend effects. We identify distinct genre-clusters from our dataset and employ a self-exciting Hawkes point-process model on each of these clusters to fully specify and estimate the video upload process. Our findings show that using a relatively parsimonious point-process model, we are able to achieve higher model fit, and predict video uploads to the platform with a higher accuracy than competing models.
Remix in 3D Printing: What your Sources say About You BIBAFull-Text 367-368
  Spiros Papadimitriou; Evangelos Papalexakis; Bin Liu; Hui Xiong
Concurrently with the recent, rapid adoption of 3D printing technologies, online sharing of 3D-printable designs is growing equally rapidly, even though it has received far less attention. We study remix relationships on Thingiverse, the dominant online repository and social network for 3D printing. We collected data of designs published over five years, and we find that remix ties exhibit both homophily and inverse-homophily across numerous key metrics, which is stronger compared to other kinds of social and content links. This may have implications on graph prediction tasks, as well as on the design of 3D-printable content repositories.
Using WikiProjects to Measure the Health of Wikipedia BIBAFull-Text 369-370
  Ramine Tinati; Markus Luczak-Roesch; Nigel Shadbolt; Wendy Hall
In this paper we examine WikiProjects, an emergent, community-driven feature of Wikipedia. We analysed 3.2 million Wikipedia articles associated with 618 active Wikipedia projects. The dataset contained the logs of over 115 million article revisions and 15 million talk entries both representing the activity of 15 million unique Wikipedians altogether. Our analysis revealed that per WikiProject, the number of article and talk contributions are increasing, as are the number of new Wikipedians contributing to individual WikiProjects. Based on these findings we consider how studying Wikipedia from a sub-community level may provide a means to measure Wikipedia activity.
Self Curation, Social Partitioning, Escaping from Prejudice and Harassment: The Many Dimensions of Lying Online BIBAFull-Text 371-372
  Max Van Kleek; Daniel Smith; Nigel R. Shadbolt; Dave Murray-Rust; Amy Guy
Portraying matters as other than they truly are is an important part of everyday human communication. In this paper, we use a survey to examine ways in which people fabricate, omit or alter the truth online. Many reasons are found, including creative expression, hiding sensitive information, role-playing, and avoiding harassment or discrimination. The results suggest lying is often used for benign purposes, and we conclude that its use may be essential to maintaining a humane online society.

Industrial Track

Constrained Optimization for Homepage Relevance BIBAFull-Text 375-384
  Deepak Agarwal; Shaunak Chatterjee; Yang Yang; Liang Zhang
This paper considers an application of showing promotional widgets to web users on the homepage of a major professional social network site. The types of widgets include address book invitation, group join, friends' skill endorsement and so forth. The objective is to optimize user engagement under certain business constraints. User actions on each widget may have very different downstream utilities, and quantification of such utilities can sometimes be quite difficult. Since there are multiple widgets to rank when a user visits, launching a personalized model to simply optimize user engagement such as clicks is often inappropriate. In this paper we propose a scalable constrained optimization framework to solve this problem. We consider several different types of constraints according to the business needs for this application. We show through both offline experiments and online A/B tests that our optimization framework can lead to significant improvement in user engagement while satisfying the desired set of business objectives.
The World Conversation: Web Page Metadata Generation From Social Sources BIBAFull-Text 385-395
  Omar Alonso; Sushma Bannur; Kartikay Khandelwal; Shankar Kalyanaraman
Over the past couple of years, social networks such as Twitter and Facebook have become the primary source for consuming information on the Internet. One of the main differentiators of this content from traditional information sources available on the Web is the fact that these social networks surface individuals' perspectives. When social media users post and share updates with friends and followers, some of those short fragments of text contain a link and a personal comment about the web page, image or video. We are interested in mining the text around those links for a better understanding of what people are saying about the object they are referring to. Capturing the salient keywords from the crowd is rich metadata that we can use to augment a web page. This metadata can be used for many applications like ranking signals, query augmentation, indexing, and for organizing and categorizing content. In this paper, we present a technique called social signatures that given a link to a web page, pulls the most important keywords from the social chatter around it. That is, a high level representation of the web page from a social media perspective. Our findings indicate that the content of social signatures differs compared to those from a web page and therefore provides new insights. This difference is more prominent as the number of link shares increase. To showcase our work, we present the results of processing a dataset that contains around 1 Billion unique URLs shared in Twitter and Facebook over a two month period. We also provide data points that shed some light on the dynamics of content sharing in social media.
Got Many Labels?: Deriving Topic Labels from Multiple Sources for Social Media Posts using Crowdsourcing and Ensemble Learning BIBAFull-Text 397-406
  Shuo Chang; Peng Dai; Jilin Chen; Ed H. Chi
Online search and item recommendation systems are often based on being able to correctly label items with topical keywords. Typically, topical labelers analyze the main text associated with the item, but social media posts are often multimedia in nature and contain contents beyond the main text. Topic labeling for social media posts is therefore an important open problem for supporting effective social media search and recommendation. In this work, we present a novel solution to this problem for Google+ posts, in which we integrated a number of different entity extractors and annotators, each responsible for a part of the post (e.g. text body, embedded picture, video, or web link). To account for the varying quality of different annotator outputs, we first utilized crowdsourcing to measure the accuracy of individual entity annotators, and then used supervised machine learning to combine different entity annotators based on their relative accuracy. Evaluating using a ground truth data set, we found that our approach substantially outperforms topic labels obtained from the main text, as well as naive combinations of the individual annotators. By accurately applying topic labels according to their relevance to social media posts, the results enables better search and item recommendation.
Question Classification by Approximating Semantics BIBAFull-Text 407-417
  Guangyu Feng; Kun Xiong; Yang Tang; Anqi Cui; Jing Bai; Hang Li; Qiang Yang; Ming Li
A central task of computational linguistics is to decide if two pieces of texts have similar meanings. Ideally, this depends on an intuitive notion of semantic distance. While this semantic distance is most likely undefinable and uncomputable, in practice it is approximated heuristically, consciously or unconsciously. In this paper, we present a theory, and its implementation, to approximate the elusive semantic distance by the well-defined information distance. It is mathematically proven that any computable approximation of the intuitive concept of semantic distance is "covered" by our theory. We have implemented our theory to question answering (QA) and performed experiments based on data extracted from over 35 million question-answer pairs. Experiments demonstrate that our initial implementation of the theory produces convincingly fewer errors in classification compared to other academic models and commercial systems.
Leveraging Careful Microblog Users for Spammer Detection BIBAFull-Text 419-429
  Hao Fu; Xing Xie; Yong Rui
Microblogging websites, e.g. Twitter and Sina Weibo, have become a popular platform for socializing and sharing information in recent years. Spammers have also discovered this new opportunity to unfairly overpower normal users with unsolicited content, namely social spams. While it is intuitive for everyone to follow legitimate users, recent studies show that both legitimate users and spammers follow spammers for different reasons. Evidence of users seeking for spammers on purpose is also observed. We regard this behavior as a useful information for spammer detection. In this paper, we approach the problem of spammer detection by leveraging the "carefulness" of users, which indicates how careful a user is when she is about to follow a potential spammer. We propose a framework to measure the carefulness, and develop a supervised learning algorithm to estimate it based on known spammers and legitimate users. We then illustrate how spammer detection can be improved in the aid of the proposed measure. Evaluation on a real dataset with millions of users and an online testing are performed on Sina Weibo. The results show that our approach indeed capture the carefulness, and it is effective to detect spammers. In addition, we find that the proposed measure is also beneficial for other applications, e.g. link prediction.
Topological Properties and Temporal Dynamics of Place Networks in Urban Environments BIBAFull-Text 431-441
  Anastasios Noulas; Blake Shaw; Renaud Lambiotte; Cecilia Mascolo
Understanding the spatial networks formed by the trajectories of mobile users can be beneficial to applications ranging from epidemiology to local search. Despite the potential for impact in a number of fields, several aspects of human mobility networks remain largely unexplored due to the lack of large-scale data at a fine spatiotemporal resolution. Using a longitudinal dataset from the location-based service Foursquare, we perform an empirical analysis of the topological properties of place networks and note their resemblance to online social networks in terms of heavy-tailed degree distributions, triadic closure mechanisms and the small world property. Unlike social networks however, place networks present a mixture of connectivity trends in terms of assortativity that are surprisingly similar to those of the web graph. We take advantage of additional semantic information to interpret how nodes that take on functional roles such as 'travel hub', or 'food spot' behave in these networks. Finally, motivated by the large volume of new links appearing in place networks over time, we formulate the classic link prediction problem in this new domain. We propose a novel variant of gravity models that brings together three essential elements of inter-place connectivity in urban environments: network-level interactions, human mobility dynamics, and geographic distance. We evaluate this model and find it outperforms a number of baseline predictors and supervised learning algorithms on a task of predicting new links in a sample of one hundred popular cities.
Synonym Discovery for Structured Entities on Heterogeneous Graphs BIBAFull-Text 443-453
  Xiang Ren; Tao Cheng
With the increasing use of entities in serving people's daily information needs, recognizing synonyms -- different ways people refer to the same entity -- has become a crucial task for many entity-leveraging applications. Previous works often take a "literal" view of the entity, i.e., its string name. In this work, we propose adopting a "structured" view of each entity by considering not only its string name, but also other important structured attributes. Unlike existing query log-based methods, we delve deeper to explore sub-queries, and exploit tailed synonyms and tailed web pages for harvesting more synonyms. A general, heterogeneous graph-based data model which encodes our problem insights is designed by capturing three key concepts (synonym candidate, web page and keyword) and different types of interactions between them. We cast the synonym discovery problem into a graph-based ranking problem and demonstrate the existence of a closed-form optimal solution for outputting entity synonym scores. Experiments on several real-life domains demonstrate the effectiveness of our proposed method.
Temporal Multi-View Inconsistency Detection for Network Traffic Analysis BIBAFull-Text 455-465
  Houping Xiao; Jing Gao; Deepak S. Turaga; Long H. Vu; Alain Biem
In this paper, we investigate the problem of identifying inconsistent hosts in large-scale enterprise networks by mining multiple views of temporal data collected from the networks. The time-varying behavior of hosts is typically consistent across multiple views, and thus hosts that exhibit inconsistent behavior are possible anomalous points to be further investigated. To achieve this goal, we develop an effective approach that extracts common patterns hidden in multiple views and detects inconsistency by measuring the deviation from these common patterns. Specifically, we first apply various anomaly detectors on the raw data and form a three-way tensor (host, time, detector) for each view. We then develop a joint probabilistic tensor factorization method to derive the latent tensor subspace, which captures common time-varying behavior across views. Based on the extracted tensor subspace, an inconsistency score is calculated for each host that measures the deviation from common behavior. We demonstrate the effectiveness of the proposed approach on two enterprise-wide network-based anomaly detection tasks. An enterprise network consists of multiple hosts (servers, desktops, laptops) and each host sends/receives a time-varying number of bytes across network protocols (e.g., TCP, UDP, ICMP) or send URL requests to DNS under various categories. The inconsistent behavior of a host is often a leading indicator of potential issues (e.g., instability, malicious behavior, or hardware malfunction). We perform experiments on real-world data collected from IBM enterprise networks, and demonstrate that the proposed method can find hosts with inconsistent behavior that are important to cybersecurity applications.

PhD Symposium

A Hybrid Approach to Perform Efficient and Effective Query Execution Against Public SPARQL Endpoints BIBAFull-Text 469-473
  Maribel Acosta
Linked Open Data initiatives have fostered the publication of Linked Data sets, as well as the deployment of publicly available SPARQL endpoints as client-server querying infrastructures to access these data sets. However, recent studies reveal that SPARQL endpoints may exhibit significant limitations in supporting real-world applications, and public linked data sets can suffer of quality issues, e.g., data can be incomplete or incorrect. We tackle these problems and propose a novel hybrid architecture that relies on shipping policies to improve the performance of SPARQL endpoints, and exploits human and machine query processing computation to enhance the quality of Linked Data sets. We report on initial empirical results that suggest that the proposed techniques overcome current drawbacks, and may provide a novel solution to make these promising infrastructures available for real-world applications.
A Taxonomy of Crowdsourcing Campaigns BIBAFull-Text 475-479
  Majid Ali AlShehry; Bruce Walker Ferguson
Crowdsourcing serves different needs of different sets of users. Most existing definitions and taxonomies of crowdsourcing address platform purpose while paying little attention to other parameters of this novel social phenomenon. In this paper, we analyze 41 crowdsourcing campaigns on 21 crowdsourcing platforms to derive 9 key parameters of successful crowdsourcing campaigns and introduce a comprehensive taxonomy of crowdsourcing. Using this taxonomy, we identify crowdsourcing trends in two parameters, platform purpose and contributor motivation. The paper highlights important advantages of using this conceptual model in planning crowdsourcing campaigns and concludes with a discussion of emerging challenges to such campaigns.
Discovering Credible Events in Near Real Time from Social Media Streams BIBAFull-Text 481-485
  Cody Buntain
My proposed research addresses fundamental deficiencies in social media-based event detection by discovering high-impact moments and evaluating their credibility rapidly. Results from my preliminary work demonstrate one can discover compelling moments by leveraging machine learning to characterize and detect bursts in keyword usage. Though this early work focused primarily on language-agnostic discovery in sporting events, it also showed promising results in adapting this work to earthquake detection. My dissertation will extend this research by adapting models to other types of high-impact events, exploring events with different temporal granularities, and finding methods to connect contextually related events into timelines. To ensure applicability of this research, I will also port these event discovery algorithms to stream processing platforms and evaluate their performance in the real-time context. To address issues of trust, my dissertation will also include developing algorithms that integrate the vast array of social media features to evaluate information credibility in near real time. Such features include structural signatures of information dissemination, the location from which a social media message was posted relative to the location of the event it describes, and metadata from related multimedia (e.g., pictures and video) shared about the event. My preliminary work also suggests methods that could be applied to social networks for stimulating trustworthy behavior and enhancing information quality. Contributions from my dissertation will primarily be practical algorithms for discovering events from various social media streams and algorithms for evaluating and enhancing the credibility of these events in near real time.
Ontology Search: Finding the Right Ontologies on the Web BIBAFull-Text 487-491
  Anila Sahar Butt
With the recent growth of Linked Data on the Web, there is an increased need for knowledge engineers to find ontologies to describe their data. Only limited work exists that addresses the problem of searching and ranking ontologies based on keyword queries. In this proposal we introduce the main challenges to find appropriate ontologies, and preliminary solutions to address these challenges. Our evaluation shows that the proposed solution performs significantly better than existing solutions on a benchmark ontology collection for the majority of the sample queries defined in the benchmark.
Make Hay While the Crowd Shines: Towards Efficient Crowdsourcing on the Web BIBAFull-Text 493-497
  Ujwal Gadiraju
Within the scope of this PhD proposal, we set out to investigate two pivotal aspects that influence the effectiveness of crowdsourcing: (i) microtask design, and (ii) workers behavior. Leveraging the dynamics of tasks that are crowdsourced on the one hand, and accounting for the behavior of workers on the other hand, can help in designing tasks efficiently. To help understand the intricacies of microtasks, we identify the need for a taxonomy of typically crowdsourced tasks. Based on an extensive study of 1000 workers on CrowdFlower, we propose a two-level categorization scheme for tasks. We present insights into the task affinity of workers, effort exerted by workers to complete tasks of various types, and their satisfaction with the monetary incentives. We also analyze the prevalent behavior of trustworthy and untrustworthy workers. Next, we propose behavioral metrics that can be used to measure and counter malicious activity in crowdsourced tasks. Finally, we present guidelines for the effective design of crowdsourced surveys and set important precedents for future work.
Mining Scholarly Communication and Interaction on the Social Web BIBAFull-Text 499-503
  Asmelash Teka Hadgu
The explosion of Web 2.0 platforms including social networking sites such as Twitter, blogs and wikis affects all web users: scholars included. As a result, there is a need for a comprehensive approach to gain a broader understanding and timely signals of scientific communication as well as how researchers interact on the social web. Most current work in this area deals with either a low number of researchers and heavily relies on manual annotation or large-scale analysis without deep understanding of the underlying researcher population. In this proposal, we present a holistic approach to solve these problems. This research proposes novel methods to collect, filter, analyze and make sense of scholars and scholarly communication by integrating heterogeneous data sources from fast social media streams as well as the academic web. Applying reproducible research, contributing applications and data sets, the thesis proposal strives to add value by mining the social web for social good.
Modeling Cognitive Processes in Social Tagging to Improve Tag Recommendations BIBAFull-Text 505-509
  Dominik Kowald
With the emergence of Web 2.0, tag recommenders have become important tools, which aim to support users in finding descriptive tags for their bookmarked resources. Although current algorithms provide good results in terms of tag prediction accuracy, they are often designed in a data-driven way and thus, lack a thorough understanding of the cognitive processes that play a role when people assign tags to resources. This thesis aims at modeling these cognitive dynamics in social tagging in order to improve tag recommendations and to better understand the underlying processes. As a first attempt in this direction, we have implemented an interplay between individual micro-level (e.g., categorizing resources or temporal dynamics) and collective macro-level (e.g., imitating other users' tags) processes in the form of a novel tag recommender algorithm. The preliminary results for datasets gathered from BibSonomy, CiteULike and Delicious show that our proposed approach can outperform current state-of-the-art algorithms, such as Collaborative Filtering, FolkRank or Pairwise Interaction Tensor Factorization. We conclude that recommender systems can be improved by incorporating related principles of human cognition.
Social Media as Firm's Network and Its Influence on the Corporate Performance BIBAFull-Text 511-514
  Jeongwoo Oh
Social media or social network service has attracted a great amount of interest of applied studies, because we can see how people connect, behave, and interact each other through it even at a glance. Individual usage of social media can be viewed from corporate level in the market. This paper starts from such interest as well, trying to verify the applicability of social media in the corporate finance study. The basic question is whether social media interaction of the firm or firm's executive can affect the performance of the corresponding firm. In the study of economics and finance, firm level network has been studied in different contexts, mainly involved with economic benefit. However, the online network has not been enough studied regarding its effect on the corporate performance. In general, firm's decision making process has been regarded as exclusive and confidential, rather than publicly observable, which resulted in focusing more on the closed network in person or between related firms. But we observe that many top executives are already active or even stars on the social media. Therefore, we take a close look at this to determine if networking on social media is personal activity or corporate behavior. In other words, we are interested in whether the internet-based life style with social media can possibly influence on the corporate performance in the market or the firm-level decision making process. We investigate this question by using both social media and market data with firm information. First of all, we identify the determinants of social network behavior of the firms' executives. And next, we estimate the value of social media network on the corporate performance, calculating abnormal returns and analyzing its dynamics in the long term. Finally, we also verify the value of social media network in the context of executives' personal compensation. We expect that this research will provide a new insight about the social network on the corporate performance by adopting online social media as a network variable. It is also expected to broaden the applicability of the online network data to the academic questions in finance research.
A Hybrid Framework for Online Execution of Linked Data Queries BIBAFull-Text 515-519
  Mohamed M. Sabri
Linked Data has been widely adopted over the last few years, with the size of the Linked Data cloud almost doubling every year. However, there is still no well-defined, efficient mechanism for querying such a Web of Data. We propose a framework that incorporates a set of optimizations to tackle various limitations in the state-of-the-art. The framework aims at combining the centralized query optimization capabilities of the data warehouse-based approaches with the result freshness and explorative data source discovery capabilities of link-traversal approaches. This is achieved by augmenting base-line link-traversal query execution with a set of optimization techniques. The proposed optimizations fall under two categories: metadata-based optimizations and semantics-based optimizations.

AW4CITY 2015

Comparing Smart Cities with different modeling approaches BIBAFull-Text 525-528
  Leonidas G. Anthopoulos; Marijn Janssen; Vishanth Weerakkody
Smart cities have attracted an extensive and increasing interest from both science and industry with an increasing number of international examples emerging from across the world. However, despite the significant role that smart cities can play to deal with recent urban challenges, the concept has been criticized for being influenced by vendor hype. There are various attempts to conceptualize smart cities and various benchmarking methods have been developed to evaluate their impact. In this paper the modelling and benchmarking approaches are systematically compared. There are six common dimensions among the approaches, namely people, government, economy, mobility, environment and living. This paper utilizes existing smart city analysis models in order to review three representative smart city cases and useful outcomes are extrapolated from this comparison.
Understanding Smart City Business Models: A Comparison BIBAFull-Text 529-534
  Leonidas G. Anthopoulos; Panos Fitsilis
Smart cities have attracted the international scientific and business attention and a niche market is being evolved, which engages almost all the business sectors. In their attempt to empower and promote urban competitive advantages, local governments have approached the smart city context and they target habitants, visitors and investments. However, engaging the smart city context is not free-of-charge and corresponding investments are extensive and of high risk without the appropriate management. Moreover, investing in the smart city domain does not secure corresponding mission success and both governments and vendors require more effective instruments. This paper performs an investigation on the smart city business models and is a work in progress. Modeling can illustrate where corresponding profit comes from and how it flows, while a significant business model portfolio is eligible for smart city stakeholders.
An Urban Fault Reporting and Management Platform for Smart Cities BIBAFull-Text 535-540
  Sergio Consoli; Diego Reforgiato Recupero; Misael Mongiovi; Valentina Presutti; Gianni Cataldi; Wladimiro Patatu
A good interaction between public administrations and citizens is imperative in modern smart cities. Semantic web technologies can aid in achieving such a goal. We present a smart urban fault reporting web platform to help citizens in reporting common urban problems, such as street faults, potholes or broken street lights, and to support the local public administration in responding and fixing those problems quickly. The tool is based on a semantic data model designed for the city, which integrates several distinct data sources, opportunely re-engineered to meet the principles of the Semantic Web and linked open data. The platform supports the whole process of road maintenance, from the fault reporting to the management of maintenance activities. The integration of multiple data sources enables increasing interoperability and heterogeneous information retrieval, thus favoring the development of effective smart urban fault reporting services. Our platform was evaluated in a real case study: a complete urban reporting and road maintenance system has been developed for the municipality of Catania. Our approach is completely generalizable and can be adopted by and customized for other cities. The final goal is to stimulate smart maintenance services in the "cities of the future".
Supporting the Development of Smart Cities using a Use Case Methodology BIBAFull-Text 541-545
  Marion Gottschalk; Mathias Uslar
Urbanization grows steadily, i.e. more humans live at one place and rural areas are more unpopular. Urbanization faces challenges for city planning and development. Cities have to deal with large crowds, high energy consumption, large quantities of garbage etc. Thus, smart cities have to meet many requirements of different areas. Hence, realizing smart cities can be supported by linking different smart areas, such as smart girds and smart homes, to one large area. The linking is done by information and communication technologies, which are supported through a clear definition of functionalities and interfaces. Smart cities and further smart areas are under development, so, it is difficult to depict an overview on their functionalities, yet. Therefore, the two approaches, use case methodology and integration profiles, are introduced in this work, which are also realized by a web-based application.
Innovative IoT-aware Services for a Smart Museum BIBAFull-Text 547-550
  Vincenzo Mighali; Giuseppe Del Fiore; Luigi Patrono; Luca Mainetti; Stefano Alletto; Giuseppe Serra; Rita Cucchiara
Smart cities are a trading topic in both the academic literature and industrial world. The capability to provide the users with added-value services through low-power and low-cost smart objects is very attractive in many fields. Among these, art and culture represent very interesting examples, as the tourism is one of the main driving engines of modern society. In this paper, we propose an IoT-aware architecture to improve the cultural experience of the user, by involving the most important recent innovations in the ICT field. The main components of the proposed architecture are: (i) an indoor localization service based on the Bluetooth Low Energy technology, (ii) a wearable device able to capture and process images related to the user's point of view, (iii) the user's mobile device useful to display customized cultural contents and to share multimedia data in the Cloud, and (iv) a processing center that manage the core of the whole business logic. In particular, it interacts with both wearable and mobile devices, and communicates with the outside world to retrieve contents from the Cloud and to provide services also to external users. The proposal is currently under development and it will be validated in the MUST museum in Lecce.
Design of Interactional End-to-End Web Applications for Smart Cities BIBAFull-Text 551-556
  Erich Ortner; Marco Mevius; Peter Wiedmann; Florian Kurz
Nowadays, the number of flexible and fast human to application system interactions is dramatically increasing. For instance, citizens interact with the help of the internet to organize surveys or meetings (in real-time) spontaneously. These interactions are supported by technologies and application systems such as free wireless networks, web -or mobile apps. Smart Cities aim at enabling their citizens to use these digital services, e.g., by providing enhanced networks and application infrastructures maintained by the public administration. However, looking beyond technology, there is still a significant lack of interaction and support between "normal" citizens and the public administration. For instance, democratic decision processes (e.g. how to allocate public disposable budgets) are often discussed by the public administration without citizen involvement. This paper introduces an approach, which describes the design of enhanced interactional web applications for Smart Cities based on dialogical logic process patterns. We demonstrate the approach with the help of a budgeting scenario as well as a summary and outlook on further research.
Smart Cities Governance Informatability?: Let's First Understand the Atoms BIBAFull-Text 557-562
  Alois Paulin
In this paper we search for and analyze the atomic components of general governance systems and discuss whether or not they can be informated, i.e. tangibly represented within the digital realm of information systems. We draw a framework based on the theories of Downs, Jellinek, and Hohfeld and find that the therein identified atomic components cannot be informated directly, but only indirectly, due to the inherent complexity of governance. We outline pending research questions to be addressed in the future.
Towards Personalized Smart City Guide Services in Future Internet Environments BIBAFull-Text 563-568
  Robert Seeliger; Christopher Krauss; Annette Wilson; Miggi Zwicklbauer; Stefan Arbanowski
The FI-CONTENT project aims at establishing the foundation of a European infrastructure for developing and testing novel smart city services. The Smart City Services Platform will develop enabling technology for SMEs and developer to create services offering residents and visitors to cities smart services that enhance their city visit or daily life. We have made use of generic, specific and common enablers to develop a reference implementation, the Smart City Guide web app. The basic information is provided by the Open City Database, an open source specific enabler that can be used for any city in Europe. Recommendation as a Service is an enabler that can be applied to lots use cases, here we describe how we integrated it into the Smart City Guide. The uses cases will be iteratively improved and upgraded during regular iterative cycles based on feedback gained in lab and field trials at the experimentation sites. As the app is transferable to any city, it will be tested at a number of experimentation sites.
A Universal Design Infrastructure for Multimodal Presentation of Materials in STEM Programs: Universal Design BIBAFull-Text 569-574
  Leyla Zhuhadar; Bryan Carson; Jerry Daday; Olfa Nasraoui
We describe a proposed universal design infrastructure that aims at promoting better opportunities for students with disabilities in STEM programs to understand multimedia teaching material. The Accessible Educational STEM Videos Project aims to transform learning and teaching for students with disabilities through integrating synchronized captioned educational videos into undergraduate and graduate STEM disciplines. This Universal Video Captioning (UVC) platform will serve as a repository for uploading videos and scripts. The proposed infrastructure is a web-based platform that uses the latest WebDAV technology (Web-based Distributed Authoring and Versioning) to identify resources, users, and content. It consists of three layers: (i) an administrative management system; (ii) a faculty/staff user interface; and (iii) a transcriber user interface. We anticipate that by enriching it with captions or transcripts, the multimodal presentation of materials promises to help students with disabilities in STEM programs master the subject better and increase retention.

BigScholar 2015

The Knowledge Web Meets Big Scholars BIBAFull-Text 577-578
  Kuansan Wang
Human is the only species on earth that has mastered the technologies in writing and printing to capture ephemeral thoughts and scientific discoveries. The capabilities to pass along knowledge, not only geographically but also generationally, have formed the bedrock of our civilizations. We are in the midst of a silent revolution driven by the technological advancements: no longer are computers just a fixture of our physical world but have they been so deeply woven into our daily routines that they are now occupying the center of our lives. No where are the phenomena more prominent than our reliance on the World Wide Web. More and more often, the web has become the primary source of fresh information and knowledge. In addition to general consumption, the availability of large amount of contents and behavioral data has also instigated new interdisciplinary research activities in the areas of information retrieval, natural language processing, machine learning, behavioral studies, social computing and data mining. This talk will use web search as an example to demonstrate how these new research activities and technologies have help the web evolve from a collection of documents to becoming the largest knowledge base in our history. During this evolution, the web is transformed from merely reacting to our needs to a living entity that can anticipate and push timely information to wherever and whenever we need it. How the scholarly activities and communications can be impacted will also be illustrated and elaborated, and some observations derived from a web scale data set, newly release to the public, will also be shared.
AVER: Random Walk Based Academic Venue Recommendation BIBAFull-Text 579-584
  Zhen Chen; Feng Xia; Huizhen Jiang; Haifeng Liu; Jun Zhang
Academic venues act as the main platform of communities in academia and the bridge of connecting researchers, which have rapidly developed in recent years. However, information overload in big scholarly data creates tremendous challenges for mining useful and effective information in order to recommend researchers to acknowledge high quality and fruitful academic venues, thereby enabling them to participate in relevant academic conferences as well as contributing to important/influential journals. In this work, we propose AVER, a novel random walk based Academic VEnue Recommendation model. AVER runs a random walk with restart model on a co-publication network which contains two kinds of associations, coauthor relations and author-venue relations. Moreover, we define a transfer matrix with bias to drive the random walk by exploiting three academic factors, co-publication frequency, weight of relations and researchers' academic level. AVER is inspired from the fact that researchers are more likely to contact those who have high co-publication frequency and similar academic levels. Additionally, in AVER, we consider the difference of weights between two kinds of associations. We conduct extensive experiments on DBLP data set in order to evaluate the performance of AVER. The results demonstrate that, in comparison to relevant baseline approaches, AVER performs better in terms of precision, recall and F1.
Discovering the Rise and Fall of Software Engineering Ideas from Scholarly Publication Data BIBAFull-Text 585-590
  Subhajit Datta; Santonu Sarkar; A. S. M. Sajeev; Nishant Kumar
For researchers and practitioners of a relatively young discipline like software engineering, an enduring concern is to identify the acorns that will grow into oaks -- ideas remaining most current in the long run. Additionally, it is interesting to know how the ideas have risen in importance, and fallen, perhaps to rise again. We analyzed a corpus of 19,000+ papers written by 21,000+ authors across 16 software engineering publication venues from 1975 to 2010, to empirically determine the half-life of software engineering research topics. We adapted existing measures of half-life as well as defined a specific measure based on publication and citation counts. The results from this empirical study are a presented in this paper.
Science Navigation Map: an Interactive Data Mining Tool for Literature Analysis BIBAFull-Text 591-596
  Yu Liu; Zhen Huang; Yizhou Yan; Yufeng Chen
With the advances of all research fields and web 2.0, scientific literature has been widely observed in digital libraries, citation databases, and social media. Its new properties, such as large volume, wide exhibition, and the complicated citation relationship in papers bring challenges to the management, analysis and exploring knowledge of scientific literature. In addition, although data mining techniques have been imported to scientific literature analysis tasks, they typically requires expert input and guidance, and returns static results to users after process, which makes them inflexible and not smart. Therefore, there is the need of a tool, which highly reflects article-level-metrics and combines human users and computer systems for analysis and exploring knowledge of scientific literature, as well as discovering and visualizing underlying interesting research topics. We design an online tool for literature navigation, filtering, and interactive data mining, named Science Navigation Map (SNM), which integrates information from online paper repositories, citation databases, etc. SNM provides visualization of article level metrics and interactive data mining which takes advantage of effective interaction between human users and computer systems to explore and extract knowledge from scientific literature and discover underlying interesting research topics. We also propose a multi-view non-negative matrix factorization and apply it to SNM as an interactive data mining tool, which can make better use of complicated multi-wise relationships in papers. In experiments, we visualize all the papers published at the journal of PLOS Biology from 2003 to 2012 in the navigation map and explore six relationship in papers for data mining. From this map, one can easily filter, analyse and explore knowledge of the papers through an interactive way.
Big Scholarly Data in CiteSeerX: Information Extraction from the Web BIBAFull-Text 597-602
  Alexander G., II Ororbia; Jian Wu; Madian Khabsa; Kyle Williams; Clyde Lee Giles
We examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing large-scale collections of scholarly documents from the world wide web. From the perspective of automatic information extraction and modes of alternative search, we examine various functional aspects of this complex system with an eye towards ongoing and future research developments.
Using Reference Groups to Assess Academic Productivity in Computer Science BIBAFull-Text 603-608
  Sabir Ribas; Berthier Ribeiro-Neto; Edmundo de Souza e Silva; Alberto Hideki Ueda; Nivio Ziviani
In this paper we discuss the problem of how to assess academic productivity based on publication outputs. We are interested in knowing how well a research group in an area of knowledge is doing relatively to a pre-selected set of reference groups, where each group is composed by academics or researchers. To assess academic productivity we adopt a new metric we propose, which we call P-score. We use P-score, citation counts and H-Index to obtain rankings of researchers in Brazil. Experimental results using data from the area of Computer Science show that P-score outperforms citation counts and H-Index when assessed against the official ranking produced by the Brazilian National Research Council (CNPq). This is of our interest for two reasons. First, it suggests that citation-based metrics, despite wide adoption, can be improved upon. Second, contrary to citation-based metrics, the P-score metric does not require access to the content of publications to be computed.
Modeling and Analysis of Scholar Mobility on Scientific Landscape BIBAFull-Text 609-614
  Qiu Fang Ying; Srinivasan Venkatramanan; Dah Ming Chiu
Scientific literature till date can be thought of as a partially revealed landscape, where scholars continue to unveil hidden knowledge by exploring novel research topics. How do scholars explore the scientific landscape, i.e., choose research topics to work on? We propose an agent-based model of topic mobility behavior where scholars migrate across research topics on the space of science following different strategies, seeking different utilities. We use this model to study whether strategies widely used in current scientific community can provide a balance between individual scientific success and the efficiency and diversity of the whole academic society. Through extensive simulations, we provide insights into the roles of different strategies, such as choosing topics according to research potential or the popularity. Our model provides a conceptual framework and a computational approach to analyze scholars' behavior and its impact on scientific production. We also discuss how such an agent-based modeling approach can be integrated with big real-world scholarly data.

DAEN 2015

The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk BIBAFull-Text 617
  Djellel Eddine Difallah; Michele Catasta; Gianluca Demartini; Panagiotis G. Ipeirotis; Philippe Cudré-Mauroux
Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.
   In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, tasks, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and how recent the batch is are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.
Co-evolutionary Dynamics of Information Diffusion and Network Structure BIBAFull-Text 619-620
  Mehrdad Farajtabar; Manuel Gomez-Rodriguez; Yichen Wang; Shuang Li; Hongyuan Zha; Le Song
Information diffusion in online social networks is obviously affected by the underlying network topology, but it also has the power to change that topology. Online users are constantly creating new links when exposed to new information sources, and in turn these links are alternating the route of information spread. However, these two highly intertwined stochastic processes, information diffusion and network evolution, have been predominantly studied separately, ignoring their co-evolutionary dynamics. In this project, we propose a probabilistic generative model, COEVOLVE, for the joint dynamics of these two processes, allowing the intensity of one process to be modulated by that of the other. This model allows us to efficiently simulate diffusion and network events from the co-evolutionary dynamics, and generate traces obeying common diffusion and network patterns observed in real-world networks. Furthermore, we also develop a convex optimization framework to learn the parameters of the model from historical diffusion and network evolution traces. We experimented with both synthetic data and data gathered from Twitter, and show that our model provides a good fit to the data as well as more accurate predictions than alternatives.
Microscopic Description and Prediction of Information Diffusion in Social Media: Quantifying the Impact of Topical Interests BIBAFull-Text 621-622
  Przemyslaw Grabowicz; Niloy Ganguly; Krishna Gummadi
A number of recent studies of information diffusion in social media, both empirical and theoretical, have been inspired by viral propagation models derived from epidemiology. These studies model propagation of memes, i.e., pieces of information, between users in a social network similarly to the way diseases spread in human society. Naturally, many of these studies emphasize social exposure, i.e., the number of friends or acquaintances of a user that have exposed a meme to her, as the primary metric for understanding, predicting, and controlling information diffusion. Intuitively, one would expect a meme to spread in a social network selectively, i.e., amongst the people who are interested in the meme. However, the importance of the alignment between the topicality of a meme and the topical interests of the potential adopters and influencers in the network has been less explored in the literature. In this paper, we quantify the impact of the topical alignment between memes and users on their adoption. Our analysis, using empirical data about two different types of memes, i.e., hashtags and URLs spreading through the Twitter social media platform, finds that topical alignment between memes and users is as crucial as the social exposure in understanding and predicting meme adoptions. Our results emphasize the need to look beyond social network-based viral propagation models and develop microscopic models of information diffusion that account for interests of users and topicality of information.
Scalable Methods for Adaptively Seeding a Social Network BIBAFull-Text 623-624
  Thibaut Horel; Yaron Singer
In many applications of influence maximization, one is restricted to select influencers from a set of users who engaged with the topic being promoted, and due to the structure of social networks, these users often rank low in terms of their influence potential. To alleviate this issue, one can consider an adaptive method which selects users in a manner which targets their influential neighbors. The advantage of such an approach is that it leverages the friendship paradox in social networks: while users are often not influential, they often know someone who is. Despite the various complexities in such optimization problems, we show that scalable adaptive seeding is achievable. To show the effectiveness of our methods we collected data from various verticals social network users follow, and applied our methods on it. Our experiments show that adaptive seeding is scalable, and that it obtains dramatic improvements over standard approaches of information dissemination.
Inferring Graphs from Cascades: A Sparse Recovery Framework BIBAFull-Text 625-626
  Jean Pouget-Abadie; Thibaut Horel
In the Graph Inference problem, one seeks to recover the edges of an unknown graph from the observations of cascades propagating over this graph. We approach this problem from the sparse recovery perspective. We introduce a general model of cascades, including the voter model and the independent cascade model, for which we provide the first algorithm which recovers the graph's edges with high probability and O(s log m) measurements where s is the maximum degree of the graph and m is the number of nodes. Furthermore, we show that our algorithm also recovers the edge weights (the parameters of the diffusion process) and is robust in the context of approximate sparsity. Finally we validate our approach empirically on synthetic graphs.

KET 2015

A Hitchhiker's Guide to Ontology BIBAFull-Text 629
  Fabian M. Suchanek
In this talk, I will present our recent work in the area of knowledge bases. It covers 4 areas of research around ontologies and knowledge bases: The first area is the construction of the YAGO knowledge base. YAGO is now multilingual, and has grown into a larger project at the Max Planck Institute for Informatics and Télécom ParisTech. The second area is the alignment of knowledge bases. This includes the alignment of classes, instances, and relations across knowledge bases. The third area is rule mining. Our project finds semantic correlations in the form of Horn rules in the knowledge base. I will also talk about watermarking approaches to trace the provenance of ontological data. Finally, I will show applications of the knowledge base for mining news corpora.
Isaac Bloomberg Meets Michael Bloomberg: Better EntityDisambiguation for the News BIBAFull-Text 631-635
  Luka Bradesko; Janez Starc; Stefano Pacifico
This paper shows the implementation and evaluation of the Entity Linking or Named Entity Disambiguation system used and developed at Bloomberg. In particular, we present and evaluate a methodology and a system that do not require the use of Wikipedia as a knowledge base or training corpus. We present how we built features for disambiguation algorithms from the Bloomberg News corpus, and how we employed them for both single-entity and joint-entity disambiguation into a Bloomberg proprietary knowledge base of people and companies. Experimental results show high quality in the disambiguation of the available annotated corpus.
A Two-Iteration Clustering Method to Reveal Unique and Hidden Characteristics of Items Based on Text Reviews BIBAFull-Text 637-642
  Alon Dayan; Osnat Mokryn; Tsvi Kuflik
This paper presents a new method for extracting unique features of items based on their textual reviews. The method is built of two similar iterations of applying a weighting scheme and then clustering the resultant set of vectors. In the first iteration, restaurants of similar food genres are grouped together into clusters. The second iteration reduces the importance of common terms in each such cluster, and highlights those that are unique to each specific restaurant. Clustering the restaurants again, now according to their unique features, reveals very interesting connections between the restaurants.
Knowledge Obtention Combining Information Extraction Techniques with Linked Data BIBAFull-Text 643-648
  Angel Luis Garrido; Pilar Blazquez; Maria G. Buey; Sergio Ilarri
Today, we can find a vast amount of textual information stored in proprietary data stores. The experience of searching information in these systems could be improved in a remarkable manner if we combine these private data stores with the information supplied by the Internet, merging both data sources to get new knowledge. In this paper, we propose an architecture with the goal of automatically obtaining knowledge about entities (e.g., persons, places, organizations, etc.) from a set of natural text documents, building smart data from raw data. We have tested the system in the context of the news archive of a real Media Group.
Topic and Sentiment Unification Maximum Entropy Model for Online Review Analysis BIBAFull-Text 649-654
  Changlin Ma; Meng Wang; Xuewen Chen
Opinion mining is an important research topic in data mining. Many current methods are coarse-grained, which are practically problemic due to insufficient feedback information and limited reference values. To address these problems, a novel topic and sentiment unification maximum entropy LDA model is proposed in this paper for fine-grained opinion mining of online reviews. In this model, a maximum entropy component is first added to the traditional LDA model to distinguish background words, aspect words and opinion words and further realize both the local and global extraction of these words. A sentiment layer is then inserted between a topic layer and a word layer to extend the proposed model to four layers. Sentiment polarity analysis is done based on the extraction of aspect words and opinion words to simultaneously acquire the sentiment polarity of the whole review and each topic, which leads to, fine-grained topic-sentiment abstract. Experimental results demonstrate the validity of the proposed model and theory.
Tree Kernel-based Protein-Protein Interaction Extraction Considering both Modal Verb Phrases and Appositive Dependency Features BIBAFull-Text 655-660
  Changlin Ma; Yong Zhang; Maoyuan Zhang
Protein-protein interaction plays an important role in understanding biological processes. In order to resolve the parsing error resulted from modal verb phrases and the noise interference brought by appositive dependency, an improved tree kernel-based PPI extraction method is proposed in this paper. Both modal verbs and appositive dependency features are considered to define some relevant processing rules which can effectively optimize and expand the shortest dependency path between two proteins in the new method. On the basis of these rules, the effective optimization and expanding path is used to direct the cutting of constituent parse tree, which makes the constituent parse tree for protein-protein interaction extraction more precise and concise. The experimental results show that the new method achieves better results on five commonly used corpora.
A Rule-Based Approach to Extracting Relations from Music Tidbits BIBAFull-Text 661-666
  Sergio Oramas; Mohamed Sordo; Luis Espinosa-Anke
This paper presents a rule based approach to extracting relations from unstructured music text sources. The proposed approach identifies and disambiguates musical entities in text, such as songs, bands, persons, albums and music genres. Candidate relations are then obtained by traversing the dependency parsing tree of each sentence in the text with at least two identified entities. A set of syntactic rules based on part of speech tags are defined to filter out spurious and irrelevant relations. The extracted entities and relations are finally represented as a knowledge graph. We test our method on texts from songfacts.com, a website that provides tidbits with facts and stories about songs. The extracted relations are evaluated intrinsically by assessing their linguistic quality, as well as extrinsically by assessing the extent to which they map an existing music knowledge base. Our system produces a vast percentage of linguistically correct relations between entities, and is able to replicate a significant part of the knowledge base.
An Architecture for Information Extraction from Figures in Digital Libraries BIBAFull-Text 667-672
  Sagnik Ray Choudhury; Clyde Lee Giles
Scholarly documents contain multiple figures representing experimental findings. These figures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such figures. Our architecture consists of the following modules: 1. An extractor for figures and associated metadata (figure captions and mentions) from PDF documents; 2. A Search engine on the extracted figures and metadata; 3. An image processing module for automated data extraction from the figures and 4. A natural language processing module to understand the semantics of the figure. We discuss the challenges in each step, report an extractor algorithm to extract vector graphics from scholarly documents and a classification algorithm for figures. Our extractor algorithm improves the state of the art by more than 10% and the classification process is very scalable, yet achieves 85% accuracy. We also describe a semi-automatic system for data extraction from figures which is integrated with our search engine to improve user experience.
Semantic Construction Grammar: Bridging the NL / Logic Divide BIBAFull-Text 673-678
  Dave Schneider; Michael J. Witbrock
In this paper, we discuss Semantic Construction Grammar (SCG), a system developed over the past several years to facilitate translation between natural language and logical representations. Crucially, SCG is designed to support a variety of different methods of representation, ranging from those that are fairly close to the NL structure (e.g. so-called "logical forms"), to those that are quite different from the NL structure, with higher-order and high-arity relations. Semantic constraints and checks on representations are integral to the process of NL understanding with SCG, and are easily carried out due to the SCG's integration with Cyc's Knowledge Base and inference engine [1], [2].
Linking Stanford Typed Dependencies to Support Text Analytics BIBAFull-Text 679-684
  Fouad Zablith; Ibrahim H. Osman
With the daily increase of the amount of published information, research in the area of text analytics is gaining more visibility. Text processing for improving analytics is being studied from different angles. In the literature, text dependencies have been employed to perform various tasks. This includes for example the identification of semantic relations and sentiment analysis. We observe that while text dependencies can boost text analytics, managing and preserving such dependencies in text documents that spread across various corpora and contexts is a challenging task. We present in this paper our work on linking text dependencies using the Resource Description Framework (RDF) specification, following the Stanford typed dependencies representation. We contribute to the field by providing analysts the means to query, extract, and reuse text dependencies for analytical purposes. We highlight how this additional layer can be used in the context of feedback analysis by applying a selection of queries passed to a triple-store containing the generated text dependencies graphs.

LiLE 2015

LRMI, Learning Resource Metadata on the Web BIBAFull-Text 687
  Phil Barker; Lorna M. Campbell
The Learning Resource Metadata Initiative (LRMI) is a collaborative initiative that aims to make it easier for teachers and learners to find educational materials through major search engines and specialized resource discovery services. The approach taken by LRMI is to extend the schema.org ontology so that educationally significant characteristics and relationships can be expressed. This, of course, builds on a long history developing metadata standards for learning resources. The context for LRMI, however, is different to these in several respects. LRMI builds on schema.org, and schema.org is designed as a means for marking up web pages to make them more intelligible to search engines; the aim is for it to be present in a significant proportion of pages on the web, that is, implemented at scale not just by metadata professionals. LRMI may have applications that go beyond the core aims of schema.org: it is possible to create LRMI metadata that is independent of a web page for example as JSON-LD records or as EPUB3 metadata.
   The approach of extending schema.org has several advantages, starting with the ability to focus on how best to describe the educational characteristics of resources while others focus on other specialist aspects of the resource description. It also means that LRMI benefits from all the effort that goes into developing tools and community resources for schema.org. There are still some challenges for LRMI, one which is particularly pertinent is that of describing educational frameworks (e.g. common curricula or educational levels) to which the learning resources align. LRMI has developed the means for expressing an alignment statement such as "this resource is useful for teaching subject X" but we need more work on how to refer to the subject in that statement. This is challenge that conventional linked data for education could address.
TinCan2PROV: Exposing Interoperable Provenance of Learning Processes through Experience API Logs BIBAFull-Text 689-694
  Tom De Nies; Frank Salliau; Ruben Verborgh; Erik Mannens; Rik Van de Walle
A popular way to log learning processes is by using the Experience API (abbreviated as xAPI), also referred to as Tin Can. While Tin Can is great for developers who need to log learning experiences in their applications, it is more challenging for data processors to interconnect and analyze the resulting data. An interoperable data model is missing to raise Tin Can to its full potential. We argue that in essence, these learning process logs are provenance. Therefore, the W3C PROV model can provide the much-needed interoperability. In this paper, we introduce a method to expose PROV using Tin Can statements. To achieve this, we made the following contributions: (1) a formal ontology of the xAPI vocabulary, (2) a context document to interpret xAPI statements as JSON-LD, (3) a mapping to convert xAPI JSON-LD statements into PROV, and (4) a tool implementing this mapping.
   We preliminarily evaluate the approach by converting 20 xAPI statements taken from the public Tin Can Learning Record Store to valid PROV. Where the conversion succeeded, it did so without loss of valid information, therefore suggesting that the conversion process is reversible, as long as the original JSON is valid.
ECOLE: Student Knowledge Assessment in the Education Process BIBAFull-Text 695-700
  Dmitry Mouromtsev; Fedor Kozlov; Liubov Kovriguina; Olga Parkhimovich
The paper concerns estimation of students' knowledge based on their learning results in the ECOLE system. ECOLE is the online eLearning system which functionality is based on several ontologies. This system allows to interlink terms from different courses and domains and calculates several educational rates: term knowledge rate, total knowledge rate, domain knowledge rate and term significance rate. All of these rates are used to give the student recommendations about the activities he has to undertake to pass a course successfully.
Linking a Community Platform to the Linked Open Data Cloud BIBAFull-Text 701-703
  Enayat Rajabi; Ivana Marenzi
Linked Data promises access to a vast amount of resources for learners and teachers. Various research projects have focused on providing educational resources as Linked Data. In many of these projects the focus has been on interoperability of metadata and on linking them into the linked data cloud. In this paper we focus on the community aspect. We start from the observation that sharing data is most valuable within communities of practice with common interests and goals, and community members are interested in suitable resources to be used in specific learning scenarios. The community of practice we are focusing on is an English language teaching and learning community, which we have been supporting through the LearnWeb2.0 platform for the last two years. We analyse the requirements of this specific community as a basis to enrich the current collected materials with open educational resources taken from the Linked Data Cloud. To this aim, we performed an interlinking approach in order to enrich the learning resources exposed as RDF (Resource Description Framework) in the LearnWeb2.0 platform with additional information taken from the Web.
Towards Analysing the Scope and Coverage of Educational Linked Data on the Web BIBAFull-Text 705-710
  Davide Taibi; Giovanni Fulantelli; Stefan Dietze; Besnik Fetahu
The diversity of datasets published according to Linked Data (LD) principles has increased in the last few years and also led to the emergence of a wide range of data suitable in educational settings. However, sufficient insights into the state, coverage and scope of available educational Linked Data seem to be missing, for instance, about represented resource types or domains and topics. In this work, we analyse the scope and coverage of educational linked data on the Web, identifying the most popular resource types and topics, apparent gaps and underlining the strong correlation of resource types and topics. Our results indicate a prevalent bias to-wards data in areas such as the life sciences as well as computing-related topics.
Interconnecting and Enriching Higher Education Programs using Linked Data BIBAFull-Text 711-716
  Fouad Zablith
Online environments are increasingly used as platforms to support and enhance learning experiences. In higher education, students enroll in programs that are usually formed of a set of courses and modules. Such courses are designed to cover a set of concepts and achieve specific learning objectives that count towards the related degree. However we observe that connections among courses and the way they conceptually interlink are hard to exploit. This is normal as courses are traditionally described using text in the form of documents such as syllabi and course catalogs. We believe that linked data can be used to create a conceptual layer around higher education programs to interlink courses in a granular and reusable manner. We present in this paper our work on creating a semantic linked data layer to conceptually connect courses taught in a higher education program. We highlight the linked data model we created to be collaboratively extended by course instructors and students using a semantic Mediawiki platform. We also present two applications that we built on top of the data to (1) showcase how learning material can now float around courses through their interlinked concepts in eLearning environments (we use moodle as a proof of concept); and (2) to support the process of higher education program reviews.

LIME 2015

From Script Idea to TV Rerun: The Idea of Linked Production Data in the Media Value Chain BIBAFull-Text 719-720
  Harald Sack
Within the process of the production of a film or tv program a significant amount of metadata is created and -- most times -- lost again. As a consequence most of this valuable information has to be costly recreated in subsequent steps of media production, distribution, and archival. On the other hand, there is no commonly used metadata exchange format throughout all steps of the media value chain. Furthermore, technical systems and software applications used in the media production process often have proprietary interfaces for data exchange. In the course of the D-Werft project funded by the German government, metadata exchange through all steps of the media value chain is to be fostered by the application of Linked Data principles. Starting with the idea for a script, metadata from existing systems and applications will be mapped to ontologies to be reused in subsequent production steps. Also for distribution and archival, metadata collected during the production process is a valuable asset to be reused for semantic and exploratory search as well as for intelligent movie recommendation and customized advertising.
Enabling access to Linked Media with SPARQL-MM BIBAFull-Text 721-726
  Thomas Kurz; Kai Schlegel; Harald Kosch
The amount of audio, video and image data on the web is immensely growing, which leads to data management problems based on the hidden character of multimedia. Therefore the interlinking of semantic concepts and media data with the aim to bridge the gap between the document web and the Web of Data has become a common practice and is known as Linked Media. However, the value of connecting media to its semantic meta data is limited due to lacking access methods specialized for media assets and fragments as well as to the variety of used description models. With SPARQL-MM we extend SPARQL, the standard query language for the Semantic Web with media specific concepts and functions to unify the access to Linked Media. In this paper we describe the motivation for SPARQL-MM, present the State of the Art of Linked Media description formats and Multimedia query languages, and outline the specification and implementation of the SPARQL-MM function set.
Defining and Evaluating Video Hyperlinking for Navigating Multimedia Archives BIBAFull-Text 727-732
  Roeland J. F. Ordelman; Maria Eskevich; Robin Aly; Benoit Huet; Gareth Jones
Multimedia hyperlinking is an emerging research topic in the context of digital libraries and (cultural heritage) archives. We have been studying the concept of video-to-video hyperlinking from a video search perspective in the context of the MediaEval evaluation benchmark for several years. Our task considers a use case of exploring large quantities of video content via an automatically created hyperlink structure at the media fragment level. In this paper we report on our findings, examine the features of the definition of video hyperlinking based on results, and discuss lessons learned with respect to evaluation of hyperlinking in real-life use scenarios.
The TIB|AV Portal as a Future Linked Media Ecosystem BIBAFull-Text 733-734
  Paloma Marín Arraiza; Sven Strobel
Various techniques for video analysis, concept mapping, semantic search and metadata management are part of the current features of the TIB|AV Portal as described in this demo. The segment identification and ontology annotation make the portal a good platform to support the Linked Data and Media. Weaving into a machine-readable metadata format will complete this task.
MICO: Towards Contextual Media Analysis BIBAFull-Text 735-736
  Sergio Fernández; Sebastian Schaffert; Thomas Kurz
With the tremendous increase in multimedia content on the Web and in corporate intranets, discovering hidden meaning in raw multimedia is becoming one of the biggest challenges. Analysing multimedia content is still in its infancy, requires expert knowledge, and the few available products are associated with excessive price tags, while still not delivering sufficient quality for many tasks. This makes it hard, especially for small and medium-size enterprises, to make use of this technology. In addition analysis components typically operate in isolation and do not consider the context (e.g. embedding text) of a media resource. This paper presents how MICO tries to address these problems by providing an open source service platform, that allows to analyse media in context and includes various analysis engines for video, images, audio, text, link structure and metadata.
Automating Annotation of Media with Linked Data Workflows BIBAFull-Text 737-738
  Thomas Wilmering; Kevin Page; György Fazekas; Simon Dixon; Sean Bechhofer
Computational feature extraction provides one means of gathering structured analytic metadata for large media collections. We demonstrate a suite of tools we have developed that automate the process of feature extraction from audio in the Internet Archive. The system constructs an RDF description of the analysis workflow and results which is then reconciled and combined with Linked Data about the recorded performance. This Linked Data and provenance information provides the bridging information necessary to employ analytic output in the generation of structured metadata for the underlying media files, with all data published within the same description framework.

LocWeb 2015

Chatty, Happy, and Smelly Maps BIBAFull-Text 741
  Daniele Quercia
Mapping apps are the greatest game-changer for encouraging people to explore the city. You take your phone out and you know immediately where to go. However, the app also assumes there are only a handful of directions to the destination. It has the power to make those handful of directions the definitive direction to that destination. A few years ago, my research started to focus on understanding how people psychologically experience the city. I used computer science tools to replicate social science experiments at scale, at web scale [4,5]. I became captivated by the beauty and genius of traditional social science experiments done by Jane Jacobs, Stanley Milgram, Kevin Lynch[1,2,3]. The result of that research has been the creation of new maps, maps where one does not only find the shortest path but also the most enjoyable path [6,9].
   We did so by building a new city map weighted for human emotions. On this cartography, one is not only able to see and connect from point A to point B the shortest segments, but one is also able to see the happy path, the beautiful path, the quiet path. In tests, participants found the happy, the beautiful, the quiet paths far more enjoyable than the shortest one, and that just by adding a few minutes to travel time. Participants also recalled how some paths smelled and sounded. So what if we had a mapping tool that would return the most enjoyable routes based not only on aesthetics but also based on smell and sound? That is the research question this talk will start to address [7,8].
Verification of POI and Location Pairs via Weakly Labeled Web Data BIBAFull-Text 743-748
  Hsiu-Min Chuang; Chia-Hui Chang
With the increased popularity of mobile devices and smart phones, location-based services (LBS) have become a common need in our daily life. Therefore, maintaining the correctness of POI (Points of Interest) data has become an important issue for many location-based services such as Google Maps and Garmin navigation systems. The simplest form of POI contains a location (e.g., represented by an address) and an identifier (e.g., an organization name) that describes the location. As time goes by, the POI relationship of a location and organization pair may change due to the opening, moving, or closing of a business. Thus, effectively identifying outdated or emerging POI relations is an important issue for improving the quality of POI data. In this paper, we examine the possibility of using location-related pages on the Web to verify existing POI relations via weakly labeled data, e.g., the co-occurrence of an organization and an address in Web pages, the published date of such pages, and the pairing diversity of an address or an organization, etc. The preliminary result shows a promising direction for discovering emerging POI and mandates more research for outdated POI.
Reconnecting Digital Publications to the Web using their Spatial Information BIBAFull-Text 749-754
  Ben De Meester; Tom De Nies; Ruben Verborgh; Erik Mannens; Rik Van de Walle
Digital publications can be packaged and viewed via the Open Web Platform using the EPUB 3 format. Meanwhile, the increased amount of mobile clients and the advent of HTML5's Geolocation have opened a whole range of possibilities for digital publications to interact with their readers. However, EPUB 3 files often remain closed silos of information, no longer linked with the rest of the Web. In this paper, we propose a solution to reconnect digital publication with the (Semantic) Web. We will also show how we can use that connection to improve contextualization for a user, specifically via spatial information. We enrich digital publications by connecting the detected concepts to their URIs on, e.g., DBpedia, and by devising an algorithm to approximate the location of any detected concept, we can provide a user with the spatial center of gravity of his reading position. The evaluation of the location approximation algorithm showed a high recall, and the high correlation between estimation error and standard deviation can provide the user with a sense of correctness (or spread) of an approximation. This means relevant locations (and their possible radius) can be shown for a user, based on the content he or she is reading, and based on his or her location. This methodology can be used to reconnect digital publications with the online world, to entice readers, and ultimately, as a novel location-based recommendation technique.
The Role of Geographic Information in News Consumption BIBAFull-Text 755-760
  Gebrekirstos G. Gebremeskel; Arjen P. de Vries
We investigate the role of geographic proximity in news consumption. Using a month-long log of user interactions with news items of ten information portals, we study the relationship between users' geographic locations and the geographic foci of information portals and local news categories. We find that the location of news consumers correlates with the geographical information of the information portals at two levels: the portal and the local news category. At the portal level, traditional mainstream news portals have a more geographically focused readership than special interest portals, such as sports and technology. At a finer level, the mainstream news portals have local news sections that have even more geographically focused readerships.

MSM 2015

Are We Really Friends?: Link Assessment in Social Networks Using Multiple Associated Interaction Networks BIBAFull-Text 771-776
  Mohammed Abufouda; Katharina A. Zweig
Many complex network systems suffer from noise that disguises the structure of the network and hinders an accurate analysis of these systems. Link assessment is the process of identifying and eliminating the noise from network systems in order to better understand these systems. In this paper, we address the link assessment problem in social networks that may suffer from noisy relationships. We employed a machine learning classifier for assessing the links in the social network of interest using the data from the associated interaction networks around it. The method was tested with two different data sets: each contains the social network of interest, with ground truth, along with the associated interaction networks. The results showed that it is possible to effectively assess the links of a social network using only the structure of a single network of the associated interaction networks and also using the structure of the whole set of the associated interaction networks. The experiment also revealed that the assessment performance using only the structure of the social network of interest is relatively less accurate than using the associated interaction networks. This indicates that link formation in the social network of interest is not only driven by the internal structure of the social network, but also influenced by the external factors provided in the associated interaction networks.
This is your Twitter on drugs: Any questions? BIBAFull-Text 777-782
  Cody Buntain; Jennifer Golbeck
Twitter can be a rich source of information when one wants to monitor trends related to a given topic. In this paper, we look at how tweets can augment a public health program that studies emerging patterns of illicit drug use. We describe the architecture necessary to collect vast numbers of tweets over time based on a large number of search terms and the challenges that come with finding relevant information in the collected tweets. We then show several examples of early analysis we have done on this data, examining temporal and geospatial trends.
Using Context to Get Novel Recommendation in Internet Message Streams BIBAFull-Text 783-786
  Doina Alexandra Dumitrescu; Simone Santini
Novelty detection algorithms usually employ similarity measures with the previous seen and relevant documents to decide if a document is of user's interest. The problem that arises by using this approach is that the system might recommend redundant documents. Thus, it has become extremely important to be able to distinguish between "redundant" and "novel" information. To address this limitation, we apply a contextual and semantic approach by building the user profile using self-organizing maps that have the advantage to easily follow the changes in the users interests.
Determining Influential Users with Supervised Random Walks BIBAFull-Text 787-792
  Georgios Katsimpras; Dimitrios Vogiatzis; Georgios Paliouras
The emergence of social media and the enormous growth of social networks have initiated a great amount of research in social influence analysis. In this regard, many approaches take into account only structural information while a few have also incorporated content. In this study we propose a new method to rank users according to their topic-sensitive influence which utilizes a priori information by employing supervised random walks. We explore the use of supervision in a PageRank-like random walk while also exploiting textual information from the available content. We perform a set of experiments on Twitter datasets and evaluate our findings.
Community Change Detection in Dynamic Networks in Noisy Environment BIBAFull-Text 793-798
  Sadamori Koujaku; Mineichi Kudo; Ichigaku Takigawa; Hideyuki Imai
Detection of anomalous changes in social networks has been studied in various applications such as change detection of social interests and virus infections. Among several kinds of network changes, we concentrate on the structural changes of relatively small stationary communities. Such a change is important because it implies that some crucial changes have happened in a special group, such as dismiss of a board of directors. One difficulty is that we have to do this in a noisy environment. This paper, therefore, proposes an algorithm that finds stationary communities in a noisy environment. Experiments on two real networks showed the advantages of our proposed algorithm.
Locally Adaptive Density Ratio for Detecting Novelty in Twitter Streams BIBAFull-Text 799-804
  Yun-Qian Miao; Ahmed K. Farahat; Mohamed S. Kamel
With the massive growth of social data, a huge attention has been given to the task of detecting key topics in the Twitter stream. In this paper, we propose the use of novelty detection techniques for identifying both emerging and evolving topics in new tweets. In specific, we propose a locally adaptive approach for density-ratio estimation in which the density ratio between new and reference data is used to capture evolving novelties, and at the same time a locally adaptive kernel is employed into the density-ratio objective function to capture emerging novelties based on the local neighborhood structure. In order to address the challenges associated with short text, we adopt an efficient approach for calculating semantic kernels with the proposed density-ratio method. A comparison to different methods shows the superiority of the proposed algorithm.
Short-Text Clustering using Statistical Semantics BIBAFull-Text 805-810
  Sepideh Seifzadeh; Ahmed K. Farahat; Mohamed S. Kamel; Fakhri Karray
Short documents are typically represented by very sparse vectors, in the space of terms. In this case, traditional techniques for calculating text similarity results in measures which are very close to zero, since documents even the very similar ones have a very few or mostly no terms in common. In order to alleviate this limitation, the representation of short-text segments should be enriched by incorporating information about correlation between terms. In other words, if two short segments do not have any common words, but terms from the first segment appear frequently with terms from the second segment in other documents, this means that these segments are semantically related, and their similarity measure should be high. Towards achieving this goal, we employ a method for enhancing document clustering using statistical semantics. However, the problem of high computation time arises when calculating correlation between all terms. In this work, we propose the selection of a few terms, and using these terms with the Nyström method to approximate the term-term correlation matrix. The selection of the terms for the Nyström method is performed by randomly sampling terms with probabilities proportional to the lengths of their vectors in the document space. This allows more important terms to have more influence on the approximation of the term-term correlation matrix and accordingly achieves better accuracy.
A Novel Agent-Based Rumor Spreading Model in Twitter BIBAFull-Text 811-814
  Emilio Serrano; Carlos Ángel Iglesias; Mercedes Garijo
Viral marketing, marketing techniques that use pre-existing social networks, has experienced a significant encouragement in the last years. In this scope, Twitter is the most studied social network in viral marketing and the rumor spread is a widely researched problem. This paper contributes with a (1) novel agent-based social simulation model for rumors spread in Twitter. This model relies on the hypothesis that (2) when a user is recovered, this user will not influence his or her neighbors in the social network to recover. To support this hypothesis: (3) two Twitter rumor datasets are studied; (4) a baseline model which does not include the hypothesis is revised, reproduced, and implemented; (5) and a number of experiments are conducted comparing the real data with the two models results.
Popularity and Quality in Social News Aggregators: A Study of Reddit and Hacker News BIBFull-Text 815-818
  Greg Stoddard
Modeling Information Diffusion in Social Media as Provenance with W3C PROV BIBAFull-Text 819-824
  Io Taxidou; Tom De Nies; Ruben Verborgh; Peter M. Fischer; Erik Mannens; Rik Van de Walle
In recent years, research in information diffusion in social media has attracted a lot of attention, since the produced data is fast, massive and viral. Additionally, the provenance of such data is equally important because it helps to judge the relevance and trustworthiness of the information enclosed in the data. However, social media currently provide insufficient mechanisms for provenance, while models of information diffusion use their own concepts and notations, targeted to specific use cases. In this paper, we propose a model for information diffusion and provenance, based on the W3C PROV Data Model. The advantage is that PROV is a Web-native and interoperable format that allows easy publication of provenance data, and minimizes the integration effort among different systems making use of PROV.
Study on the Relationship between Profile Images and User Behaviors on Twitter BIBAFull-Text 825-828
  Tomu Tominaga; Yoshinori Hijikata
In recent years, many researchers have studied the characteristics of Twitter, which is a microblogging service used by a large number of people worldwide. However, to the best of our knowledge, no study has yet been conducted to study the relationship between profile images and user behaviors on Twitter. We assume that the profile images and behaviors of users are influenced by their internal properties, because users consider their profile images as symbolic representations of themselves on Twitter. We empirically categorized profile images into 13 types, and investigated the relationships between each category of profile images and users' behaviors on Twitter.
PTHMM: Beyond Single Specific Behavior Prediction BIBAFull-Text 829-832
  Suncong Zheng; Hongyun Bao; Guanhua Tian; Yufang Wu; Bo Xu; Hongwei Hao
Existing works on user behavior analysis mainly focus on modeling a single behavior and predicting whether a user will take an action or not. However, users' behaviors do not always happen in isolation, sometimes, different behaviors may happen simultaneously. Therefore, in this paper, we try to analyze the combination of basic behaviors, called behavioral state here, which can describes users' complex behaviors comprehensively. We propose a model, called Personal Timed Hidden Markov Model (PTHMM), to settle the problem by considering time-interval information of users' behaviors and users' personalization. The experimental result on sina-weibo demonstrates the effectiveness of the model. It also shows that users' behavioral state is affected by their historical behaviors, and the influence of historical behaviors declines with the increasing of historical time.

MWA 2015

Multilingual Word Sense Induction to Improve Web Search Result Clustering BIBAFull-Text 835-839
  Lorenzo Albano; Domenico Beneventano; Sonia Bergamaschi
In [Marco2013] a novel approach to Web search result clustering based on Word Sense Induction, i.e. the automatic discovery of word senses from raw text was presented; key to the proposed approach is the idea of, first, automatically inducing senses for the target query and, second, clustering the search results based on their semantic similarity to the word senses induced. In [1] we proposed an innovative Word Sense Induction method based on multilingual data; key to our approach was the idea that a multilingual context representation, where the context of the words is expanded by considering its translations in different languages, may improve the WSI results; the experiments showed a clear performance gain. In this paper we give some preliminary ideas to exploit our multilingual Word Sense Induction method to Web search result clustering.
Document Categorization using Multilingual Associative Networks based on Wikipedia BIBAFull-Text 841-846
  Niels Bloom; Mariët Theune; Franciska De Jong
Associative networks are a connectionist language model with the ability to categorize large sets of documents. In this research we combine monolingual associative networks based on Wikipedia to create a larger, multilingual associative network, using the cross-lingual connections between Wikipedia articles. We prove that such multilingual associative networks perform better than monolingual associative networks in tasks related to document categorization by comparing the results of both types of associative network on a multilingual dataset.
Exceptional Texts On The Multilingual Web BIBAFull-Text 847-851
  Gavin Brelstaff; Francesca Chessa
Great writers help keep a language efficient for discourse of all kinds. In doing so they produce exceptional texts which may defy Statistical Machine Translation by employing uncommon idiom. Such "turns of phrase" can enter into a Nation's collective memory and form the basis from which compassion and conviction are conveyed during important national discourse. Communities that unite across language barriers have no such robust basis for discourse. Here we describe a Multilingual Web prototype application that promotes appreciation of exceptional texts by non-native readers. The application allows dual column original/translation texts (in Open Office format) to be imported into the translator's browser, to be manually aligned for semantic correspondence, to be aligned with an audio reading, and then saved as HTML5 for subsequent presentation to non-native readers. We hope to provide a new way of experiencing exceptional texts (poetry, here) that transmits their significance without incurring extraneous distraction. We motivate, outline and illustrate our application in action.
"Answer ka type kya he?": Learning to Classify Questions in Code-Mixed Language BIBAFull-Text 853-858
  Khyathi Chandu Raghavi; Manoj Kumar Chinnakotla; Manish Shrivastava
Code-Mixing (CM) is defined as the embedding of linguistic units such as phrases, words, and morphemes of one language into an utterance of another language. CM is a natural phenomenon observed in many multilingual societies. It helps in speeding-up communication and allows wider variety of expression due to which it has become a popular mode of communication in social media forums like Facebook and Twitter. However, current Question Answering (QA) research and systems only support expressing a question in a single language which is an unrealistic and hard proposition especially for certain domains like health and technology. In this paper, we take the first step towards the development of a full-fledged QA system in CM language which is building a Question Classification (QC) system. The QC system analyzes the user question and infers the expected Answer Type (AType). The AType helps in locating and verifying the answer as it imposes certain type-specific constraints. In this paper, we present our initial efforts towards building a full-fledged QA system for CM language. We learn a basic Support Vector Machine (SVM) based QC system for English-Hindi CM questions. Due to the inherent complexities involved in processing CM language and also the unavailability of language processing resources such POS taggers, Chunkers, Parsers, we design our current system using only word-level resources such as language identification, transliteration and lexical translation. To reduce data sparsity and leverage resources available in a resource-rich language, in stead of extracting features directly from the original CM words, we translate them commonly into English and then perform featurization. We created an evaluation dataset for this task and our system achieves an accuracy of 63% and 45% in coarse-grained and fine-grained categories of the question taxonomy. The idea of translating features into English indeed helps in improving accuracy over the unigram baseline.
A Comparative Study of Online Translation Services for Cross Language Information Retrieval BIBAFull-Text 859-864
  Ali Hosseinzadeh Vahid; Piyush Arora; Qun Liu; Gareth J. F. Jones
Technical advances and its increasing availability, mean that Machine Translation (MT) is now widely used for the translation of search queries in multilingual search tasks. A number of free-to-use high-quality online MT systems are now available and, although imperfect in their translation behavior, are found to produce good performance in Cross-Language Information Retrieval (CLIR) applications. Users of these MT systems in CLIR tasks generally assume that they all behave similarly in CLIR applications, and the choice of MT system is often made on the basis of convenience. We present a set of experiments which compare the impact of applying two of the best known online systems, Google and Bing translation, for query translation across multiple language pairs and for two very different CLIR tasks. Our experiments show that the MT systems perform differently on average for different tasks and language pairs, but more significantly for different individual queries. We examine the differing translation behavior of these tools and seek to draw conclusions in terms of their suitability for use in different settings.
Understanding Multilingual Social Networks in Online Immigrant Communities BIBAFull-Text 865-870
  Evangelos Papalexakis; A. Seza Dogruöz
There are more multilingual speakers in the world than monolingual ones. Immigration is one of the key factors to bring speakers of different languages in contact with each other. In order to develop relevant policies and recommendations tailored according to the needs of immigrant communities, it is essential to understand the interactions between the users within and across sub-communities. Using a novel method (tensor analysis), we reveal the social network structure of an online multilingual discussion forum which hosts an immigrant community in the Netherlands. In addition to the network structure, we automatically discover and categorize monolingual and bilingual sub-communities and track their formation, evolution and dissolution over a long period of time.
Exploring Current Accessibility Challenges in the Multilingual Web for Visually-Impaired Users BIBAFull-Text 871-873
  Silvia Rodríguez Vázquez
The Web is an open network accessed by people across countries, languages and cultures, irrespective of their functional diversity. Over the last two decades, interest about web accessibility issues has significantly increased among web professionals, but people with disabilities still encounter significant difficulties when browsing the Internet. In the particular case of blind users, the use of assistive technologies such as screen readers is key to navigate and interact with web content. Although research efforts made until now have led to a better understanding of visually-impaired users' browsing behavior and, hence, the definition of web design best practices for an improved user experience by this population group, the particularities of websites with multiple language versions have been mostly overlooked. This communication paper seeks to shed light on the major challenges faced by visually impaired users when accessing the multilingual web, as well as on why and how the web localization community should contribute to a more accessible web for all.
Online Searching in English as a Foreign Language BIBAFull-Text 875-880
  Gyöngyi Rózsa; Anita Komlodi; Peng Chu
Online searching is a central element of internet users' information behaviors. Searching is usually executed in a user's native language, but searching in English as a foreign language is often necessitated by the lack of content in languages that are underrepresented in Web content. This paper reports results from a study of searching in English as a foreign language and aims at understanding this particular group of users' behaviors. Searchers whose native language is not English may have to resort to queries in English in support of their information needs due to the lack or low quality of the web content in their own language. However, when searching for information in a foreign language, users face a unique set of challenges that are not present for native language searching. We studied this problem through qualitative research methods and report results from focus groups in this paper. The results reported in this paper describe typical problems foreign language searchers face, the differences in information-seeking behavior in English and in the participants' native language, and advice and ideas shared by the focus group participants about how to search effectively and efficiently in English.

NewsWWW 2015

Supply and Demand: Propagation and Absorption of News BIBAFull-Text 883
  Anastassia Fedyk
The importance of the media for individual and market behavior cannot be overstated. For example, a front-page article in the New York Times that mostly reprints information from six months prior can cause a company's stock price to jump by over 300%. To better understand the channels through which the media affects markets and the resulting implications for news production, we study how individuals process information in news. Do readers display a preference for news with a positive slant? Are consumers of news segregated based on the media outlets they favor? Do individuals recognize which news is novel, and which simply reprints old information? While these questions are grounded in fundamental human psychology, they are also inextricably linked to the rapidly changing technology of news production. With over a million stories a day passing through the Bloomberg terminal alone, the volume of data -- both on the content of news and the behavior of readers -- has skyrocketed. As a result, analysis of media production and consumption requires ever more sophisticated techniques for identifying the informational value of news and the behavioral patterns of its modern readers.
Scalable Preference Learning from Data Streams BIBAFull-Text 885-890
  Fabon Dzogang; Thomas Lansdall-Welfare; Saatviga Sudhahar; Nello Cristianini
We study the task of learning the preferences of online readers of news, based on their past choices. Previous work has shown that it is possible to model this situation as a competition between articles, where the most appealing articles of the day are those selected by the most users. The appeal of an article can be computed from its textual content, and the evaluation function can be learned from training data. In this paper, we show how this task can benefit from an efficient algorithm, based on hashing representations, which enables it to be deployed on high intensity data streams. We demonstrate the effectiveness of this approach on four real world news streams, compare it with standard approaches, and describe a new online demonstration based on this technology.
Interpreting News Recommendation Models BIBAFull-Text 891-892
  Blaz Fortuna; Pat Moore; Marko Grobelnik
This paper presents an approach for recommending news articles on a large news portal. Focus is given to interpretability of the developed models, analysis of their performance, and deriving understanding of short and long-term user behavior on a news portal.
Measuring Gender Bias in News Images BIBAFull-Text 893-898
  Sen Jia; Thomas Lansdall-Welfare; Nello Cristianini
Analysing the representation of gender in news media has a long history within the fields of journalism, media and communication. Typically this can be performed by measuring how often people of each gender are mentioned within the textual content of news articles. In this paper, we adopt a different approach, classifying the faces in images of news articles into their respective gender. We present a study on 885,573 news articles gathered from the web, covering a period of four months between 19th October 2014 and 19th January 2015 from 882 news outlets. Findings show that gender bias differs by topic, with Fashion and the Arts showing the least bias. Comparisons of gender bias by outlet suggest that tabloid-style news outlets may be less gender-biased than broadsheet-style ones, supporting previous results from textual content analysis of news articles.
Towards a Complete Event Type Taxonomy BIBAFull-Text 899-902
  Aljaz Košmerlj; Evgenia Belyaeva; Gregor Leban; Marko Grobelnik; Blaz Fortuna
We present initial results of our effort to build an extensive and complete taxonomy of events described in news articles. By crawling Wikipedia's current events portal we identified nine top-level event types. Using articles referenced by the portal we built a event type classification model for news articles using lexical and semantic features and present a small-scale manual evaluation of its results. Results show that our model can accurately distinguish between event types but its coverage could still be significantly improved.
The Computable News project: Research in the Newsroom BIBAFull-Text 903-908
  Will Radford; Daniel Tse; Joel Nothman; Ben Hachey; George Wright; James R. Curran; Will Cannings; Tim O'Keefe; Matthew Honnibal; David Vadas; Candice Loxley
We report on a four year academic research project to build a natural language processing platform in support of a large media company. The Computable News platform processes news stories, producing a layer of structured data that can be used to build rich applications. We describe the underlying platform and the research tasks that we explored building it. The platform supports a wide range of prototype applications designed to support different newsroom functions. We hope that this qualitative review provides some insight into the challenges involved in this type of project.

OOEW 2015

Adaptive Sequential Experimentation Techniques for A/B Testing and Model Tuning BIBAFull-Text 911
  Scott Clark
We introduce Bayesian Global Optimization as an efficient way to optimize a system's parameters, when evaluating parameters is time-consuming or expensive. The adaptive sequential experimentation techniques described can be used to help tackle a myriad of problems including optimizing a system's click-through or conversion rate via online A/B testing, tuning parameters of a machine learning prediction method or expensive batch job, designing an engineering system or finding the optimal parameters of a real-world physical experiment. We explore different tools available for performing these tasks, including Yelp's MOE and SigOpt. We will present the motivation, implementation, and background of these tools. Applications and examples from industry and best practices for using the techniques will be provided.
Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments BIBAFull-Text 913
  Alex Deng
As A/B testing gains wider adoption in the industry, more people begin to realize the limitations of the traditional frequentist null hypothesis statistical testing (NHST). The large number of search results for the query "Bayesian A/B testing" shows just how much the interest in the Bayesian perspective is growing. In recent years there are also voices arguing that Bayesian A/B testing should replace frequentist NHST and is strictly superior in all aspects. Our goal here is to clarify the myth by looking at both advantages and issues of Bayesian methods. In particular, we propose an objective Bayesian A/B testing framework for which we hope to bring the best from Bayesian and frequentist methods together. Unlike traditional methods, this method requires the existence of historical A/B test data to objectively learn a prior. We have successfully applied this method to Bing, using thousands of experiments to establish the priors.
Can I Take a Peek?: Continuous Monitoring of Online A/B Tests BIBAFull-Text 915
  Ramesh Johari
A/B testing is a hallmark of Internet services: from e-commerce sites to social networks to marketplaces, nearly all online services use randomized experiments as a mechanism to make better business decisions. Such tests are generally analyzed using classical frequentist statistical measures: p-values and confidence intervals. Despite their ubiquity, these reported values are computed under the assumption that the experimenter will not continuously monitor their test -- in other words, there should be no repeated "peeking" at the results that affects the decision of whether to continue the test. On the other hand, one of the greatest benefits of advances in information technology, computational power, and visualization is precisely the fact that experimenters can watch experiments in progress, with greater granularity and insight over time than ever before.
   We ask the question: if users will continuously monitor experiments, then what statistical methodology is appropriate for hypothesis testing, significance, and confidence intervals? We present recent work addressing this question. In particular, building from results in sequential hypothesis testing, we present analogues of classical frequentist statistical measures that are valid even though users are continuously monitoring the results.
Online Search Evaluation with Interleaving BIBAFull-Text 917
  Filip Radlinski
Online evaluation allows information retrieval systems to be assessed based on how real users respond to search results presented. Compared with traditional offline evaluation based on manual relevance assessments, online evaluation is particularly attractive in settings where reliable assessments are difficult or too expensive to obtain. However, the successful use of online evaluation requires the right metrics to be used, as real user behaviour is often difficult to interpret. I will present interleaving, a sensitive online evaluation approach that creates paired comparisons for every user query, and compare it with alternative A/B online evaluation approaches. I will also show how interleaving can be parameterized to create a family of evaluation metrics that can be chosen to best match the goals of an evaluation.
Offline Evaluation of Response Prediction in Online Advertising Auctions BIBAFull-Text 919-922
  Olivier Chapelle
Click-through rates and conversion rates are two core machine learning problems in online advertising. The evaluation of such systems is often based on traditional supervised learning metrics that ignore how the predictions are used. These predictions are in fact part of bidding systems in online advertising auctions. We present here an empirical evaluation of a metric that is specifically tailored for auctions in online advertising and show that it correlates better than standard metrics with A/B test results.
Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments BIBAFull-Text 923-928
  Alex Deng
As A/B testing gains wider adoption in the industry, more people begin to realize the limitations of the traditional frequentist null hypothesis statistical testing (NHST). The large number of search results for the query "Bayesian A/B testing" shows just how much the interest in the Bayesian perspective is growing. In recent years there are also voices arguing that Bayesian A/B testing should replace frequentist NHST and is strictly superior in all aspects. Our goal here is to clarify the myth by looking at both advantages and issues of Bayesian methods. In particular, we propose an objective Bayesian A/B testing framework for which we hope to bring the best from Bayesian and frequentist methods together. Unlike traditional methods, this method requires the existence of historical A/B test data to objectively learn a prior. We have successfully applied this method to Bing, using thousands of experiments to establish the priors.
Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study BIBAFull-Text 929-934
  Lihong Li; Shunbao Chen; Jim Kleban; Ankur Gupta
Optimizing an interactive system against a predefined online metric is particularly challenging, especially when the metric is computed from user feedback such as clicks and payments. The key challenge is the counterfactual nature: in the case of Web search, any change to a component of the search engine may result in a different search result page for the same query, but we normally cannot infer reliably from search log how users would react to the new result page. Consequently, it appears impossible to accurately estimate online metrics that depend on user feedback, unless the new engine is actually run to serve live users and compared with a baseline in a controlled experiment. This approach, while valid and successful, is unfortunately expensive and time-consuming. In this paper, we propose to address this problem using causal inference techniques, under the contextual-bandit framework. This approach effectively allows one to run potentially many online experiments offline from search log, making it possible to estimate and optimize online metrics quickly and inexpensively. Focusing on an important component in a commercial search engine, we show how these ideas can be instantiated and applied, and obtain very promising results that suggest the wide applicability of these techniques.
Unbiased Ranking Evaluation on a Budget BIBAFull-Text 935-937
  Tobias Schnabel; Adith Swaminathan; Thorsten Joachims
We address the problem of assessing the quality of a ranking system (e.g., search engine, recommender system, review ranker) given a fixed budget for collecting expert judgments. In particular, we propose a method that selects which items to judge in order to optimize the accuracy of the quality estimate. Our method is not only efficient, but also provides estimates that are unbiased -- unlike common approaches that tend to underestimate performance or that have a bias against new systems that are evaluated re-using previous relevance scores.
Counterfactual Risk Minimization BIBAFull-Text 939-941
  Adith Swaminathan; Thorsten Joachims
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we derive generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. Using the CRM principle, we derive a new learning algorithm -- Policy Optimizer for Exponential Models (POEM) -- for structured output prediction. We evaluate POEM on several multi-label classification problems and verify that its empirical performance supports the theory.
What Size Should A Mobile Ad Be? BIBAFull-Text 943-944
  Pengyuan Wang; Wei Sun; Dawei Yin
We present a causal inference framework for evaluating the impact of advertising treatments. Our framework is computationally efficient by employing a tree structure that specifies the relationship between user characteristics and the corresponding ad treatment. We illustrate the applicability of our proposal on a novel advertising effectiveness study: finding the best ad size on different mobile devices in order to maximize the success rates. The study shows a surprising phenomenon that a larger mobile device does not need a larger ad. In particular, the 300*250 ad size is universally good for all the mobile devices, regardless of the mobile device size.

RDSM 2015

Discriminative Models for Predicting Deception Strategies BIBAFull-Text 947-952
  Darren Scott Appling; Erica J. Briscoe; Clayton J. Hutto
Although a large body of work has previously investigated various cues predicting deceptive communications, especially as demonstrated through written and spoken language (e.g., [30]), little has been done to explore predicting kinds of deception. We present novel work to evaluate the use of textual cues to discriminate between deception strategies (such as exaggeration or falsification), concentrating on intentionally untruthful statements meant to persuade in a social media context. We conduct human subjects experimentation wherein subjects were engaged in a conversational task and then asked to label the kind(s) of deception they employed for each deceptive statement made. We then develop discriminative models to understand the difficulty between choosing between one and several strategies. We evaluate the models using precision and recall for strategy prediction among 4 deception strategies based on the most relevant psycholinguistic, structural, and data-driven cues. Our single strategy model results demonstrate as much as a 58% increase over baseline (random chance) accuracy and we also find that it is more difficult to predict certain kinds of deception than others.
Assessment of Tweet Credibility with LDA Features BIBAFull-Text 953-958
  Jun Ito; Jing Song; Hiroyuki Toda; Yoshimasa Koike; Satoshi Oyama
With the fast development of Social Networking Services (SNS) such as Twitter, which enable users to exchange short messages online, people can get information not only from the traditional news media but also from the masses of SNS users. However, SNS users sometimes propagate spurious or misleading information, so an effective way to automatically assess the credibility of information is required. In this paper, we propose methods to assess information credibility on Twitter, methods that utilize the "tweet topic" and "user topic" features derived from the Latent Dirichlet Allocation (LDA) model. We collected two thousand tweets labeled by seven annotators each, and designed effective features for our classifier on the basis of data analysis results. An experiment we conducted showed a 3% improvement in Area Under Curve (AUC) scores compared with existing methods, leading us to conclude that using topical features is an effective way to assess tweet credibility.
Visualization of Trustworthiness Graphs BIBAFull-Text 959-964
  Stephen Mayhew; Dan Roth
Trustworthiness is a field of research that seeks to estimate the credibility of information by using knowledge of the source of the information. The most interesting form of this problem is when different pieces of information share sources, and when there is conflicting information from different sources. This model can be naturally represented as a bipartite graph. In order to understand this data well, it is important to have several methods of exploring it. A good visualization can help to understand the problem in a way that no simple statistics can. This paper defines several desiderata for a "good" visualization and presents three different visualization methods for trustworthiness graphs.
   The first visualization method is simply a naive bipartite layout, which is infeasible in nearly all cases. The second method is a physics-based graph layout that reveals some interesting and important structure of the graph. The third method is an orthogonal approach based on the adjacency matrix representation of a graph, but with many improvements that give valuable insights into the structure of the trustworthiness graph.
   We present interactive web-based software for the third form of visualization.
Crowdsourced Rumour Identification During Emergencies BIBAFull-Text 965-970
  Richard McCreadie; Craig Macdonald; Iadh Ounis
When a significant event occurs, many social media users leverage platforms such as Twitter to track that event. Moreover, emergency response agencies are increasingly looking to social media as a source of real-time information about such events. However, false information and rumours are often spread during such events, which can influence public opinion and limit the usefulness of social media for emergency management. In this paper, we present an initial study into rumour identification during emergencies using crowdsourcing. In particular, through an analysis of three tweet datasets relating to emergency events from 2014, we propose a taxonomy of tweets relating to rumours. We then perform a crowdsourced labeling experiment to determine whether crowd assessors can identify rumour-related tweets and where such labeling can fail. Our results show that overall, agreement over the tweet labels produced were high (0.7634 Fleiss Kappa), indicating that crowd-based rumour labeling is possible. However, not all tweets are of equal difficulty to assess. Indeed, we show that tweets containing disputed/controversial information tend to be some of the most difficult to identify.
Detecting Singleton Review Spammers Using Semantic Similarity BIBAFull-Text 971-976
  Vlad Sandulescu; Martin Ester
Online reviews have increasingly become a very important resource for consumers when making purchases. Though it is becoming more and more difficult for people to make well-informed buying decisions without being deceived by fake reviews. Prior works on the opinion spam problem mostly considered classifying fake reviews using behavioral user patterns. They focused on prolific users who write more than a couple of reviews, discarding one-time reviewers. The number of singleton reviewers however is expected to be high for many review websites. While behavioral patterns are effective when dealing with elite users, for one-time reviewers, the review text needs to be exploited. In this paper we tackle the problem of detecting fake reviews written by the same person using multiple names, posting each review under a different name. We propose two methods to detect similar reviews and show the results generally outperform the vectorial similarity measures used in prior works. The first method extends the semantic similarity between words to the reviews level. The second method is based on topic modeling and exploits the similarity of the reviews topic distributions using two models: bag-of-words and bag-of-opinion-phrases. The experiments were conducted on reviews from three different datasets: Yelp (57K reviews), Trustpilot (9K reviews) and Ott dataset (800 reviews).
Fact-checking Effect on Viral Hoaxes: A Model of Misinformation Spread in Social Networks BIBAFull-Text 977-982
  Marcella Tambuscio; Giancarlo Ruffo; Alessandro Flammini; Filippo Menczer
spread of misinformation, rumors and hoaxes. The goal of this work is to introduce a simple modeling framework to study the diffusion of hoaxes and in particular how the availability of debunking information may contain their diffusion. As traditionally done in the mathematical modeling of information diffusion processes, we regard hoaxes as viruses: users can become infected if they are exposed to them, and turn into spreaders as a consequence. Upon verification, users can also turn into non-believers and spread the same attitude with a mechanism analogous to that of the hoax-spreaders. Both believers and non-believers, as time passes, can return to a susceptible state. Our model is characterized by four parameters: spreading rate, gullibility, probability to verify a hoax, and that to forget one's current belief. Simulations on homogeneous, heterogeneous, and real networks for a wide range of parameters values reveal a threshold for the fact-checking probability that guarantees the complete removal of the hoax from the network. Via a mean field approximation, we establish that the threshold value does not depend on the spreading rate but only on the gullibility and forgetting probability. Our approach allows to quantitatively gauge the minimal reaction necessary to eradicate a hoax.
Real-Time News Certification System on Sina Weibo BIBAFull-Text 983-988
  Xing Zhou; Juan Cao; Zhiwei Jin; Fei Xie; Yu Su; Dafeng Chu; Xuehui Cao; Junqiang Zhang
In this paper, we propose a novel framework for real-time news certification. Traditional methods detect rumors on message-level and analyze the credibility of one tweet. However, in most occasions, we only remember the keywords of an event and it's hard for us to completely describe an event in a tweet. Based on the keywords of an event, we gather related microblogs through a distributed data acquisition system which solves the real-time processing needs. Then, we build an ensemble model that combine user-based, propagation-based and content-based model. The experiments show that our system can give a response at 35 seconds on average per query which is critical for real-time system. Most importantly, our ensemble model boost the performance. We also offer some important information such as key users, key microblogs and timeline of events for further investigation of an event. Our system is already deployed in the Xihua News Agency for half a year. To the best of our knowledge, this is the first real-time news certification system for verifying social media contents.

SAVE-SD 2015

Increasing the Productivity of Scholarship: The Case for Knowledge Graphs BIBAFull-Text 993
  Paul Groth
Over the past several years, we have seen an explosion in the number of tools and services that enable scholars to improve their personal productivity whether it is socially enabled reference managers or cloud-hosted experimental environments. However, we have yet to see a step-change in the productivity of the system of scholarship as a whole. While there are certainly broader social reasons for this, in this talk I argue that we are just now at a technical position to create radical change in how scholarship is performed. Specifically, I will discuss how recent advances in machine reading, developments in open data and explicit social networks, can be used to create scholarly knowledge graphs. These graphs can connect the underlying intellectual corpus with ongoing discourse allowing the development of algorithms that hypothesize, filter and reflect alongside humans.
It is not What but Who you Know: A Time-Sensitive Collaboration Impact Measure of Researchers in Surrounding Communities BIBAFull-Text 995-1000
  Luigi Di Caro; Mario Cataldi; Myriam Lamolle; Claudio Schifanella
In the last decades, many measures and metrics have been proposed with the goal of automatically providing quantitative rather than qualitative indications over researchers' academic productions. However, when evaluating a researcher, most of the commonly-applied measures do not consider one of the key aspect of every research work: the collaborations among researchers and, more specifically, the impact that each co-author has on the scientific production of another. In fact, in an evaluation process, some co-authored works can unconditionally favor researchers working in competitive research environments surrounded by experts able to lead high-quality research projects, where state-of-the-art measures usually fail in trying to distinguish co-authors from their pure publication history. In the light of this, instead of focusing on a pure quantitative/qualitative evaluation of curricula, we propose a novel temporal model for formalizing and estimating the dependence of a researcher on individual collaborations, over time, in surrounding communities. We then implemented and evaluated our model with a set of experiments on real case scenarios and through an extensive user study.
Exploring Bibliographies for Research-related Tasks BIBAFull-Text 1001-1006
  Angelo Di Iorio; Raffaele Giannella; Francesco Poggi; Fabio Vitali
Bibliographies are fundamental tools for research communities. Besides the obvious uses as connection to previous research, citations are also widely used for evaluation purposes: the productivity of researchers, departments and universities is increasingly measured by counting their citations. Unfortunately, citations counters are just rough indicators: a deeper knowledge of individual citations -- where, when, by whom and why -- improves research evaluation tasks and supports researchers in their daily activity. Yet, such information is mostly hidden within repositories of scholarly papers and is still difficult to find, navigate and make use of.
   In this paper, we present a novel tool for exploring scientific articles through their citations. The environment is built on top of a rich citation network, encoded as a LOD, and includes a user-friendly interface to access, filter and highlight information about bibliographic data.
Conference Live: Accessible and Sociable Conference Semantic Data BIBAFull-Text 1007-1012
  Anna Lisa Gentile; Maribel Acosta; Luca Costabello; Andrea Giovanni Nuzzolese; Valentina Presutti; Diego Reforgiato Recupero
In this paper we describe Conference Live, a semantic Web application to browse conference data. Conference Live is a Web and mobile application based on conference data from the Semantic Web Dog Food server, which provides facilities to browse papers and authors at a specific conference. Available data for the specific conference is enriched with social features (e.g. integrated Twitter accounts of paper authors), scheduling features (calendar information are attached for paper presentations and social events), the possibility to check and add feedback to each paper and to vote for papers, if the conference includes sessions where participants can vote, as it is popular e.g. for poster sessions. As use case we report on the usage of the application at the Extended Semantic Web Conference (ESWC) in May 2014.
Period Assertion as Nanopublication: The PeriodO Period Gazetteer BIBAFull-Text 1013-1018
  Patrick Golden; Ryan Shaw
The PeriodO period gazetteer collects definitions of time periods made by archaeologists and other historical scholars. In constructing the gazetteer, we sought to make period definitions parsable and comparable by computers while also retaining the broader scholarly context in which they were conceived. Our approach resulted in a dataset of period definitions and their provenances that resemble what data scientists working in the e-science domain have dubbed "nanopublications." In this paper we describe the origin and goals of nanopublications, provide an overview of the design and implementation of a database of period definitions, and highlight the similarities and differences between the two.
The Paper or the Video: Why Choose? BIBAFull-Text 1019-1022
  Hugo Mougard; Matthieu Riou; Colin de la Higuera; Solen Quiniou; Olivier Aubert
This paper investigates the possibilities offered by the more and more common availability of scientific video material. In particular it investigates how to best study research results by combining recorded talks and their corresponding scientific articles. To do so, it outlines desired properties of an interesting e-research system based on cognitive considerations and considers related issues. This design work is completed by the introduction of two prototypes.
What's in this paper?: Combining Rhetorical Entities with Linked Open Data for Semantic Literature Querying BIBAFull-Text 1023-1028
  Bahar Sateli; René Witte
Finding research literature pertaining to a task at hand is one of the essential tasks that scientists face on daily basis. Standard information retrieval techniques allow to quickly obtain a vast number of potentially relevant documents. Unfortunately, the search results then require significant effort for manual inspection, where we would rather select relevant publications based on more fine-grained, semantically rich queries involving a publication's contributions, methods, or application domains. We argue that a novel combination of three distinct methods can significantly advance this vision: (i) Natural Language Processing (NLP) for Rhetorical Entity (RE) detection; (ii) Named Entity (NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic generation of RDF triples for both NEs and REs using semantic web ontologies to interconnect them. Combined in a single workflow, these techniques allow us to automatically construct a knowledge base that facilitates numerous advanced use cases for managing scientific documents.
Using Linked Data Traversal to Label Academic Communities BIBAFull-Text 1029-1034
  Ilaria Tiddi; Mathieu d'Aquin; Enrico Motta
In this paper we exploit knowledge from Linked Data to ease the process of analysing scholarly data. In the last years, many techniques have been presented with the aim of analysing such data and revealing new, unrevealed knowledge, generally presented in the form of "patterns". However, the discovered patterns often still require human interpretation to be further exploited, which might be a time and energy consuming process. Our idea is that the knowledge shared within Linked Data can actuality help and ease the process of interpreting these patterns. In practice, we show how research communities obtained through standard network analytics techniques can be made more understandable through exploiting the knowledge contained in Linked Data. To this end, we apply our system Dedalo that, by performing a simple Linked Data traversal, is able to automatically label clusters of words, corresponding to topics of the different communities.
A Model for Copyright and Licensing: Elsevier's Copyright Model BIBAFull-Text 1035-1038
  Anna Tordai
With the rise of digital publishing and open access it has become increasingly important for publishers to store information about ownership and licensing of published works in a robust way. In a fast moving environment Elsevier, a leading science publisher, recognizes the importance of sound models underlying its data. In this paper, we describe a data model for copyright and licensing used by Elsevier for capturing elements of copyright. We explain some of the rationale behind the model and provide examples of frequently occurring cases in terms of the model.
Mapping The Evolution of Scientific Community Structures in Time BIBAFull-Text 1039-1044
  Theresa Velden; Shiyan Yan; Kan Yu; Carl Lagoze
The increasing online availability of scholarly corpora promises unprecedented opportunities for visualizing and studying scholarly communities. We seek to leverage this with a mixed-method approach that integrates network analysis of features of the online corpora with ethnographic studies of the communities that produce them. In our development of tools and visualizations we seek to support the going back and forth between views of community structures and the perceptions and research trajectories of individual researchers and research groups. We here present results from tracking the temporal evolution of community structures within a research specialty. We explore how the temporal evolution of these maps can be used to provide insights into the historical evolution of a field as well as extract more accurate snapshots of the community structures at a given point in time. We are currently conducting qualitative interviews with experts in this research specialty to assess the validity of the maps.
Research Collaboration and Topic Trends in Computer Science: An Analysis Based on UCP Authors BIBAFull-Text 1045-1050
  Yan Wu; Srinivasan Venkatramanan; Dah Ming Chiu
Academic publication metadata can be used to analyze the collaboration, productivity and hot topic trends of a research community. Recently, it is shown that authors with uninterrupted and continuous presence (UCP) over a time window, though small in number (about 1%), amass the majority of significant and high-influence academic output. We adopt the UCP metric to retrieve the most active authors in the Computer Science (CS) community over different time windows in the past 50 years, and use them to analyze collaboration, productivity and topic trends. We show that the UCP authors are representative of the overall population; the community is increasingly moving in the direction of Team Research (as opposed to Soloist or Mentor-mentee research), with increased level and degree of collaboration; and the research topics become increasingly inter-related. By focusing on the UCP authors, we can more easily visualize these trends.
Enhanced Publication Management Systems: A Systemic Approach Towards Modern Scientific Communication BIBAFull-Text 1051-1052
  Alessia Bardi; Paolo Manghi
Enhanced Publication Information Systems (EPISs) are information systems devised for the management of enhanced publications (EP), i.e. digital publications enriched with (links to) other research outcomes such as data, processing workflows, software. Today, EPISs are typically realised with a "from scratch" approach that entails non-negligible implementation and maintenance costs.
   This work argues for a more systemic approach to narrow those costs and presents the notion of Enhanced Publication Management Systems, software frameworks that support the realisation of EPISs by providing developers with EP-oriented tools and functionalities.
Visualizing Collaborations and Online Social Interactions at Scientific Conferences for Scholarly Networking BIBAFull-Text 1053-1054
  Laurens De Vocht; Selver Softic; Anastasia Dimou; Ruben Verborgh; Erik Mannens; Martin Ebner; Rik Van de Walle
The various ways of interacting with social media, web collaboration tools, co-authorship and citation networks for scientific and research purposes remain distinct. In this paper, we propose a solution to align such information. We particularly developed an exploratory visualization of research networks. The result is a scholar centered, multi-perspective view of conferences and people based on their collaborations and online interactions. We measured the relevance and user acceptance of this type of interactive visualization. Preliminary results indicate a high precision both for recognized people and conferences. The majority in a group of test-users responded positively to a set of statements about the acceptance.
Collaborative Exchange of Systematic Literature Review Results: The Case of Empirical Software Engineering BIBAFull-Text 1055-1056
  Fajar J. Ekaputra; Marta Sabou; Estefanía Serral; Stefan Biffl
Complementary to managing bibliographic information as done by digital libraries, the management of concrete research objects (e.g., experimental workflows, design patterns) is a pre-requisite to foster collaboration and reuse of research results. In this paper we describe the case of the Empirical Software Engineering domain, where researchers use systematic literature reviews (SLRs) to conduct and report on literature studies. Given their structured nature, the outputs of such SLR processes are a special and complex type of research object. Since performing SLRs is a time consuming process, it is highly desirable to enable sharing and reuse of the complex knowledge structures produced through SLRs. This would enable, for example, conducting new studies that build on the findings of previous studies. To support collaborative features necessary for multiple research groups to share and reuse each other's work, we hereby propose a solution approach that is inspired by software engineering best-practices and is implemented using Semantic Web technologies.
LDP4ROs: Managing Research Objects with the W3C Linked Data Platform BIBAFull-Text 1057-1058
  Daniel Garijo; Nandana Mihindukulasooriya; Oscar Corcho
In this demo we present LDP4ROs, a prototype implementation that allows creating, browsing and updating Research Objects (ROs) and their contents using typical HTTP operations. This is achieved by aligning the RO model with the W3C Linked Data Platform (LDP).
Visual-Based Classification of Figures from Scientific Literature BIBAFull-Text 1059-1060
  Theodoros Giannakopoulos; Ioannis Foufoulas; Eleftherios Stamatogiannakis; Harry Dimitropoulos; Natalia Manola; Yannis Ioannidis
Authors of scientific publications and books use images to present a wide spectrum of information. Despite the richness of the visual content of scientific publications the figures are usually not taken into consideration in the context of text mining methodologies towards the automatic indexing and retrieval of scientific corpora. In this work, we present a system for automatic categorization of figures from scientific literature to a set of predefined classes. We have employed a wide range of visual features that achieve high discrimination ability between the adopted classes. A real-world dataset has been compiled and annotated in order to train and evaluate the proposed method using three different classification schemata.
Science Bots: A Model for the Future of Scientific Computation? BIBAFull-Text 1061-1062
  Tobias Kuhn
As a response to the trends of the increasing importance of computational approaches and the accelerating pace in science, I propose in this position paper to establish the concept of "science bots" that autonomously perform programmed tasks on input data they encounter and immediately publish the results. We can let such bots participate in a reputation system together with human users, meaning that bots and humans get positive or negative feedback by other participants. Positive reputation given to these bots would also shine on their owners, motivating them to contribute to this system, while negative reputation will allow us to filter out low-quality data, which is inevitable in an open and decentralized system.


Predicting Pinterest: Organising the World's Images with Human-machine Collaboration BIBAFull-Text 1065
  Nishanth Sastry
The user generated content revolution has created a glut of multimedia content online -- from Flickr to Facebook, new images are being made available for public consumption everyday. In this talk, we will first explore how, on sites such as Pinterest, users are bringing order to this burgeoning collection by manually curating collections of images in ways that are highly personalised and relevant to their own use. We will then discuss the phenomenon of social bootstrapping, whereby existing mature social networks such as Facebook are helping bootstrap engaged communities of content curators on external sites such as Pinterest. Finally, we will demonstrate how the manual effort involved in curation can be amplified using a unique human-machine collaboration: By treating the curation efforts of a subset of users on Pinterest as a distributed human computation over a low-dimensional approximation of the content corpus, we derive simple yet powerful signals, which, when combined with image-related features drawn from state-of-the-art deep learning techniques, allow us to automatically and accurately populate the personalised curated collections of all other users.
Challenges of Forecasting and Measuring a Complex Networked World BIBAFull-Text 1067
  Bruno Ribeiro
A new era of data analytics of online social networks promises tremendous high-impact societal, business, and healthcare applications. As more users join online social networks, the data available for analysis and forecast of human social and collective behavior grows at an incredible pace. The first part of this talk introduces an apparent paradox, where larger online social networks entail more user data but also less analytic and forecasting capabilities [7]. More specifically, the paradox applies to forecasting properties of network processes such as network cascades, showing that in some scenarios unbiased long term forecasting becomes increasingly inaccurate as the network grows but, paradoxically, short term forecasting -- such as the predictions in Cheng et al. [2] and Ribeiro et al. [7] -- improves with network size. We discuss the theoretic foundations of this paradox and its connections with known information theoretic measures such as Shannon capacity. We also discuss the implications of this paradox on the scalability of big data applications and show how information theory tools -- such as Fisher information [3,8] -- can be used to design more accurate and scalable methods for network analytics [6,8,10]. The second part of the talk focuses on how these results impact our ability to perform network analytics when network data is only available through crawlers and the complete network topology is unknown [1,4,5,9].
Understanding Complex Networks Using Graph Spectrum BIBAFull-Text 1069-1072
  Yanhua Li; Zhi-Li Zhang
Complex networks are becoming indispensable parts of our lives. The Internet, wireless (cellular) networks, online social networks, and transportation networks are examples of some well-known complex networks around us. These networks generate an immense range of big data: weblogs, social media, the Internet traffic, which have increasingly drawn attentions from the computer science research community to explore and investigate the fundamental properties of, and improve the user experiences on, these complex networks. This work focuses on understanding complex networks based on the graph spectrum, namely, developing and applying spectral graph theories and models for understanding and employing versatile and oblivious network information -- asymmetrical characteristics of the wireless transmission channels, multiplex social relations, e.g., trust and distrust relations, etc -- in solving various application problems, such as estimating transmission cost in wireless networks, Internet traffic engineering, and social influence analysis in social networks.
Pivotality of Nodes in Reachability Problems Using Avoidance and Transit Hitting Time Metrics BIBAFull-Text 1073-1078
  Golshan Golnari; Yanhua Li; Zhi-Li Zhang
Reachability is crucial to many network operations in various complex networks. More often than not, however, it is not sufficient simply to know whether a source node s can reach a target node t in the network. Additional information associated with reachability such as how long or how many possible ways node s may take to reach node t. In this paper we analyze another piece of important information associated with reachability -- which we call pivotality. Pivotality captures how pivotal a role that a node k or a subset of nodes S may play in the reachability from node s to node t in a given network. We propose two important metrics, the avoidance and transit hitting times, which extend and generalize the classical notion of hitting times. We show these metrics can be computed from the fundamental matrices associated with the appropriately defined random walk transition probability matrices and prove that the classical hitting time from a source to a target can be decomposed into the avoidance and transit hitting times with respect to any third node. Through simulated and real-world network examples, we demonstrate that these metrics provide a powerful ranking tool for the nodes based on their pivotality in the reachability.
Rise and Fall of Online Game Groups: Common Findings on Two Different Games BIBAFull-Text 1079-1084
  Ah Reum Kang; Juyong Park; Jina Lee; Huy Kang Kim
Among many types of online games, Massively Multiplayer Online Role Playing Games (MMORPGs) provide players with the most realistic gaming experience inspired by the real, offline world. In particular, much stress is put upon socializing and collaboration with others as a condition for one's success, just as in real life. An advantage of studying MMORPGs is that since all actions are recorded, we can observe phenomena that are hard to observe in real life. For instance, we could observe how the all-important collaboration between people come into being, evolve, and eventually die out from the data to gain valuable insights to the group dynamics. In this paper, we analyzed the successes and failures of the online game groups in two different MMORPG, ArcheAge of XLGames, Inc. and Aion of NCsoft, Inc.. We find that there exist factors that influence the dynamics of group growth common to the games regardless of the games' maturity.
Finding Relevant Indian Judgments using Dispersion of Citation Network BIBAFull-Text 1085-1088
  Akshay Minocha; Navjyoti Singh; Arjit Srivastava
We construct a complex citation network of a subset of Indian Constitutional Articles and the legal judgments that invoke them. We describe, how this dataset is constructed and also introduce the term of dispersion from network science related to social networks, in the context of legal relevance. Our research shows that dispersion is a decisive structural feature to show the importance of relevant legal judgments and landmark decisions. Our method provides similarity information about the document in question, which otherwise remains undetected by standard citation metrics.
On Skewed Distributions and Straight Lines: A Case Study on the Wiki Collaboration Network BIBAFull-Text 1089-1094
  Osnat Mokryn; Alexey Reznik
In this paper, we present a hypothesis that power laws are found only in datasets sampled from a static data, in which each and every item has gained its maximal importance and is not in the process of changing it during the sampling period. We motivate our hypothesis by examining languages, and word-ranking distribution as it appears in books, and in the Bible. To demonstrate the validity of our hypothesis, we experiment with the Wikipedia edit collaboration network. We find that the dataset fits a skewed distribution. Next, we identify its dynamic part. We then show that when the modified part is removed from the obtained dataset, the remaining static part exhibits a good fit to a power law distribution.
Distributed Community Detection with the WCC Metric BIBAFull-Text 1095-1100
  Matthew Saltz; Arnau Prat-Pérez; David Dominguez-Sal
Community detection has become an extremely active area of research in recent years, with researchers proposing various new metrics and algorithms to address the problem. Recently, the Weighted Community Clustering (WCC) metric was proposed as a novel way to judge the quality of a community partitioning based on the distribution of triangles in the graph, and was demonstrated to yield superior results over other commonly used metrics like modularity. The same authors later presented a parallel algorithm for optimizing WCC on large graphs. In this paper, we propose a new distributed, vertex-centric algorithm for community detection using the WCC metric. Results are presented that demonstrate the algorithm's performance and scalability on up to 32 worker machines and real graphs of up to 1.8 billion edges. The algorithm scales best with the largest graphs, finishing in just over an hour for the largest graph, and to our knowledge, it is the first distributed algorithm for optimizing the WCC metric.

SocialNLP 2015

Mining Social and Urban Big Data BIBAFull-Text 1103
  Nicholas Jing Yuan
In recent years, with the rapid development of positioning technologies, online social networks, sensors and smart devices, large scale human behavioral data are now readily available. The growing availability of such behavioral data provides us unprecedented opportunities to gain more in depth understanding of users in both the physical world and cyber world, especially in online social networks. In this talk, I will introduce our recent research efforts in social and urban mining based on large-scale human behavioral datasets showcased by two projects: 1) LifeSpec: Modeling the spectrum of urban lifestyles based on heterogeneous online social network data. 2) L2P: Inferring demographic attributes from location check-ins.
Expert-Guided Contrastive Opinion Summarization for Controversial Issues BIBAFull-Text 1105-1110
  Jinlong Guo; Yujie Lu; Tatsunori Mori; Catherine Blake
This paper presents a new model for the task of contrastive opinion summarization (COS) particularly for controversial issues. Traditional COS methods, which mainly rely on sentence similarity measures are not sufficient for a complex controversial issue. We therefore propose an Expert-Guided Contrastive Opinion Summarization (ECOS) model. Compared to previous methods, our model can (1) integrate expert opinions with ordinary opinions from social media and (2) better align the contrastive arguments under the guidance of expert prior opinion. We create a new data set about a complex social issue with "sufficient" controversy and experimental results on this data show that the proposed model are effective for (1) producing better arguments summary in understanding a controversial issue and (2) generating contrastive sentence pairs.
ResToRinG CaPitaLiZaTion in #TweeTs BIBAFull-Text 1111-1115
  Kamel Nebhi; Kalina Bontcheva; Genevieve Gorrell
The rapid proliferation of microblogs such as Twitter has resulted in a vast quantity of written text becoming available that contains interesting information for NLP tasks. However, the noise level in tweets is so high that standard NLP tools perform poorly. In this paper, we present a statistical truecaser for tweets using a 3-gram language model built with truecased newswire texts and tweets. Our truecasing method shows an improvement in named entity recognition and part-of-speech tagging tasks.
Supervised Prediction of Social Network Links Using Implicit Sources of Information BIBAFull-Text 1117-1122
  Ervin Tasnádi; Gábor Berend
In this paper, we introduce a supervised machine learning framework for the link prediction problem. The social network we conducted our empirical evaluation on originates from the restaurant review portal, yelp.com. The proposed framework not only uses the structure of the social network to predict non-existing edges in it, but also makes use of further graphs that were constructed based on implicit information provided in the dataset. The implicit information we relied on includes the language use of the members of the social network and their ratings with respect the businesses they reviewed. Here, we also investigate the possibility of building supervised learning models to predict social links without relying on features derived from the structure of the social network itself, but based on such implicit information alone. Our empirical results not only revealed that the features derived from different sources of implicit information can be useful on their own, but also that incorporating them in a unified framework has the potential to improve classification results, as the different sources of implicit information can provide independent and useful views about the connectedness of users.

SOCM 2015

An Explorative Approach for Crowdsourcing Tasks Design BIBAFull-Text 1125-1130
  Marco Brambilla; Stefano Ceri; Andrea Mauri; Riccardo Volonterio
Crowdsourcing applications are becoming widespread; they cover very different scenarios, including opinion mining, multimedia data annotation, localised information gathering, marketing campaigns, expert response gathering, and so on. The quality of the outcome of these applications depends on different design parameters and constraints, and it is very hard to judge about their combined effects without doing some experiments; on the other hand, there are no experiences or guidelines that tell how to conduct experiments, and thus these are often conducted in an ad-hoc manner, typically through adjustments of an initial strategy that may converge to a parameter setting which is quite different from the best possible one. In this paper we propose a comparative, explorative approach for designing crowdsourcing tasks. The method consists of defining a representative set of execution strategies, then execute them on a small dataset, then collect quality measures for each candidate strategy, and finally decide the strategy to be used with the complete dataset.
Towards Government as a Social Machine BIBAFull-Text 1131-1136
  Vanilson Burégio; Kellyton Brito; Nelson Rosa; Misael Neto; Vinícius Garcia; Silvio Meira
Government initiatives to open data to the public are becoming increasingly popular every day. The vast amount of data made available by government organizations yields interesting opportunities and challenges -- both socially and technically. In this paper, we propose a social machine-oriented architecture as a way to extend the power of open data and create the basis to derive government as a social machine (Gov-SM). The proposed Gov-SM combines principles from existing architectural patterns and provides a platform of specialized APIs to enable the creation of several other social-technical systems on top of it. Based on some implementation experiences, we believe that deriving government as a social machine can, in more than one sense, collaborate to fully integrate users, developers and crowd in order to participate in and solve a multitude of governmental issues and policy.
When Resources Collide: Towards a Theory of Coincidence in Information Spaces BIBAFull-Text 1137-1142
  Markus Luczak-Roesch; Ramine Tinati; Nigel Shadbolt
This paper is an attempt to lay out foundations for a general theory of coincidence in information spaces such as the World Wide Web, expanding on existing work on bursty structures in document streams and information cascades. We elaborate on the hypothesis that every resource that is published in an information space, enters a temporary interaction with another resource once a unique explicit or implicit reference between the two is found. This thought is motivated by Erwin Shroedingers notion of entanglement between quantum systems. We present a generic information cascade model that exploits only the temporal order of information sharing activities, combined with inherent properties of the shared information resources. The approach was applied to data from the world's largest online citizen science platform Zooniverse and we report about findings of this case study.
On Wayfaring in Social Machines BIBAFull-Text 1143-1148
  Dave Murray-Rust; Segolene Tarte; Mark Hartswood; Owen Green
In this paper, we concern ourselves with the ways in which humans inhabit social machines: the structures and techniques which allow the enmeshing of multiple life traces within the flow of online interaction. In particular, we explore the distinction between transport and journeying, between networks and meshworks, and the different attitudes and modes of being appropriate to each. By doing this, we hope to capture a part of the sociality of social machines, to build an understanding of the ways in which lived lives relate to digital structures, and the emergence of the communality of shared work. In order to illustrate these ideas, we look at several aspects of existing social machines, and tease apart the qualities which relate to the different modes of being. The distinctions and concepts outlined here provide another element in both the analysis and development of social machines, understanding how people may joyfully and directedly engage with collective activities on the web.
A Streaming Real-Time Web Observatory Architecture for Monitoring the Health of Social Machines BIBAFull-Text 1149-1154
  Ramine Tinati; Xin Wang; Ian Brown; Thanassis Tiropanis; Wendy Hall
Over the past years, streaming Web services have become popular, with many of the top Web platforms now offering near real-time streams of user and machine activity. In light of this, Web Observatories now are faced with the challenge of being able to process and republish real-time, big data, Web streams, whilst maintaining access control and data consistency. In this paper we describe the architecture used in the Southampton Web Observatory to harvest, process, and serve real-time Web streams.
Social Personal Data Stores: the Nuclei of Decentralised Social Machines BIBAFull-Text 1155-1160
  Max Van Kleek; Daniel A. Smith; Dave Murray-Rust; Amy Guy; Kieron O'Hara; Laura Dragan; Nigel R. Shadbolt
Personal Data Stores are among the many efforts that are currently underway to try to re-decentralise the Web, and to bring more control and data management and storage capability under the control of the user. Few of these architectures, however, have considered the needs of supporting decentralised social software from the user's perspective. In this short paper, we present the results of our design exercise, focusing on two key design needs for building decentralised social machines: that of supporting heterogeneous social apps and multiple, separable user identities. We then present the technical design of a prototype social machine platform, INDX, which realises both of these requirements, and a prototype heterogeneous microblogging application which demonstrates its capabilities.
Revisiting the Three Rs of Social Machines: Reflexivity, Recognition and Responsivity BIBAFull-Text 1161-1166
  Jeff Vass; Jo E. Munson
This paper sets out an approach to Social Machines (SMs), their description and analysis, based on a development of social constructionist theoretical principles adapted for Web Science. We argue that currently the search for the primitives of SMs, or appropriate units of analysis to describe them, tends to favour either the technology or sociality. We suggest an approach that favours distributed agency whether it is machinic or human or both. We argue that current thinking (e.g. Actor Network Theory) is unsuited to SMs. Instead we describe an alternative which prioritizes a view of socio-technical activity as forming 'reflexive project structures'. We show that reflexivity in social systems can be further usefully divided into more fundamental elements (Recognition and Responsivity). This process enables us to capture more of the variation in SMs and to distinguish them from non-Web based socio-technical systems. We illustrate the approach by looking at different kinds of SMs showing how they relate to contemporary social theory.

SWDM 2015

Towards Next-Generation Software Infrastructure for Crisis Informatics Research BIBAFull-Text 1169
  Kenneth M. Anderson
Crisis Informatics is a multidisciplinary research area that examines the socio-technical relationships among people, information, and technology during mass emergency events. One area of crisis informatics examines the on-line behaviors of members of the public making use of social media during a crisis event to make sense of it, to report on it, and, in some cases, to coordinate a response to it either locally or from afar. In order to study those behaviors, this social media data has to be systematically captured and stored in a scalable and reliable way for later analysis. Project EPIC is a large U.S. National Science Foundation funded project that has been performing crisis informatics research since Fall 2009 and has been designing and developing a reliable and robust software infrastructure for the storage and analysis of large crisis informatics data sets. Prof. Ken Anderson has led the research and development in this software engineering effort and will discuss the challenges (both technical and social) that Project EPIC faced in developing its software infrastructure, known as EPIC Collect and EPIC Analyze. EPIC Collect has been in 24/7 operation in various forms since Spring 2010 and has collected terabytes of social media data across hundreds of mass emergency events since that time. EPIC Analyze is a data analysis platform for large social media data sets that provides efficient browsing, filtering, and collaborative annotation services. Prof. Anderson will discuss these systems and also present the challenges of collecting and analyzing social media data (with an emphasis on Twitter data) at scale. Project EPIC has designed and evaluated software architectural styles that can be adopted by other research groups to help develop their own capacity to work in this space. Prof. Anderson will conclude the talk with a vision for future work in this area: What's next for crisis informatics software infrastructure?
Social Media for Cold Management BIBAFull-Text 1171
  Daniele Quercia
Social media has been increasingly used to manage emergencies (hot management), yet it is still unclear to which extent its use is truly beneficial to manage calm situations (cold management). Take socioeconomic deprivation of cities. Measuring it in an accurate and timely fashion has become a priority for governments around the world. Traditionally, deprivation indexes have been derived from census data, which is however very expensive to obtain, and thus acquired only every few years. In recent years, we have proposed alternative computational methods to automatically extract proxies of deprivation at a fine spatio-temporal level of granularity. More specifically, we have proposed new ways of: a) mining deprivation at a fine level of spatio-temporal granularity [Venerandi15]; b) profiling the functional and temporal uses of cities [ruiz15taxonomy]; and c) determining which streets are safe from crime and which are walkable [Quercia2015digital]. All this only requires access to freely available user-generated content (on, e.g., Foursquare, Open Street Map, Flickr), and, as such, is complementary to the use of expensive proprietary data and outdated governmental data.
Classification Method for Shared Information on Twitter Without Text Data BIBAFull-Text 1173-1178
  Seigo Baba; Fujio Toriumi; Takeshi Sakaki; Kosuke Shinoda; Satoshi Kurihara; Kazuhiro Kazama; Itsuki Noda
During a disaster, appropriate information must be collected. For example, victims and survivors require information about shelter locations and dangerous points or advice about protecting themselves. Rescuers need information about the details of volunteer activities and supplies, especially potential shortages. However, collecting such localized information is difficult from such mass media as TV and newspapers because they generally focus on information aimed at the general public. On the other hand, social media can attract more attention than mass media under these circumstances since they can provide such localized information. In this paper, we focus on Twitter, one of the most influential social media, as a source of local information. By assuming that users who retweet the same tweet are interested in the same topic, we can classify tweets that are required by users with similar interests based on retweets. Thus, we propose a novel tweet classification method that focuses on retweets without text mining. We linked tweets based on retweets to make a retweet network that connects similar tweets and extracted clusters that contain similar tweets from the constructed network by our clustering method. We also subjectively verified the validity of our proposed classification method. Our experiment verified that the ratio of the clusters whose tweets are mutually similar in the cluster to all clusters is very high and the similarities in each cluster are obvious. Finally, we calculated the linguistic similarities of the results to clarify our proposed method's features. Our method classified topic-similar tweets, even if they are not linguistically similar.
Trust-building through Social Media Communications in Disaster Management BIBAFull-Text 1179-1184
  Maria Grazia Busa; Maria Teresa Musacchio; Shane Finan; Cilian Fennell
Social media provides a digital space -- a meeting place, for different people, often representing one or more groups in a society. The use of this space during a disaster, especially where information needs are high and the availability of factually accurate and ethically sourced data is scarce, has increased substantially over the last 5-10 years. This paper attempts to address communication in social media and trust between the public and figures of authority during a natural disaster in order to suggest communication strategies that can enhance or reinforce trust between these bodies before, during and after a natural disaster.
Sentiment Analysis on Microblogs for Natural Disasters Management: a Study on the 2014 Genoa Floodings BIBAFull-Text 1185-1188
  Davide Buscaldi; Irazú Hernandez-Farias
People use social networks for different communication purposes, for example to share their opinion on ongoing events. One way to exploit this common knowledge is by using Sentiment Analysis and Natural Language Processing in order to extract useful information. In this paper we present a SA approach applied to a set of tweets related to a recent natural disaster in Italy; our goal is to identify tweets that may provide useful information from a disaster management perspective.
Identifying Relevant Messages in a Twitter-based Citizen Channel for Natural Disaster Situations BIBAFull-Text 1189-1194
  Alfredo Cobo; Denis Parra; Jaime Navón
During recent years the online social networks (in particular Twitter) have become an important alternative information channel to traditional media during natural disasters, but the amount and diversity of messages poses the challenge of information overload to end users. The goal of our research is to develop an automatic classifier of tweets to feed a mobile application that reduces the difficulties that citizens face to get relevant information during natural disasters. In this paper, we present in detail the process to build a classifier that filters tweets relevant and non-relevant to an earthquake. By using a dataset from the Chilean earthquake of 2010, we first build and validate a ground truth, and then we contribute by presenting in detail the effect of class imbalance and dimensionality reduction over 5 classifiers. We show how the performance of these models is affected by these variables, providing important considerations at the moment of building these systems.
A Linguistically-driven Approach to Cross-Event Damage Assessment of Natural Disasters from Social Media Messages BIBAFull-Text 1195-1200
  Stefano Cresci; Maurizio Tesconi; Andrea Cimino; Felice Dell'Orletta
This work focuses on the analysis of Italian social media messages for disaster management and aims at the detection of messages carrying critical information for the damage assessment task. A main novelty of this study consists in the focus on out-domain and cross-event damage detection, and on the investigation of the most relevant tweet-derived features for these tasks. We devised different experiments by resorting to a wide set of linguistic features qualifying the lexical and grammatical structure of a text as well as ad-hoc features specifically implemented for this task. We investigated the most effective features that allow to achieve the best results. A further result of this study is the construction of the first manually annotated Italian corpus of social media messages for damage assessment.
Disentangling the Lexicons of Disaster Response in Twitter BIBAFull-Text 1201-1204
  Nathan O. Hodas; Greg Ver Steeg; Joshua Harrison; Satish Chikkagoudar; Eric Bell; Courtney D. Corley
People around the world use social media platforms such as Twitter to express their opinion and share activities about various aspects of daily life. In the same way social media changes communication in daily life, it also is transforming the way individuals communicate during disasters and emergencies. Because emergency officials have come to rely on social media to communicate alerts and updates, they must learn how users communicate disaster related content on social media. We used a novel information-theoretic unsupervised learning tool, CorEx, to extract and characterize highly relevant content used by the public on Twitter during known emergencies, such as fires, explosions, and hurricanes. Using the resulting analysis, authorities may be able to score social media content and prioritize their attention toward those messages most likely to be related to the disaster.
Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams BIBAFull-Text 1205-1210
  Muhammad Imran; Carlos Castillo
While categorizing any type of user-generated content online is a challenging problem, categorizing social media messages during a crisis situation adds an additional layer of complexity, due to the volume and variability of information, and to the fact that these messages must be classified as soon as they arrive. Current approaches involve the use of automatic classification, human classification, or a mixture of both. In these types of approaches, there are several reasons to keep the number of information categories small and updated, which we examine in this article. This means at the onset of a crisis an expert must select a handful of information categories into which information will be categorized. The next step, as the crisis unfolds, is to dynamically change the initial set as new information is posted online. In this paper, we propose an effective way to dynamically extract emerging, potentially interesting, new categories from social media data.
Visualizing Social Media Sentiment in Disaster Scenarios BIBAFull-Text 1211-1215
  Yafeng Lu; Xia Hu; Feng Wang; Shamanth Kumar; Huan Liu; Ross Maciejewski
Recently, social media, such as Twitter, has been successfully used as a proxy to gauge the impacts of disasters in real time. However, most previous analyses of social media during disaster response focus on the magnitude and location of social media discussion. In this work, we explore the impact that disasters have on the underlying sentiment of social media streams. During disasters, people may assume negative sentiments discussing lives lost and property damage, other people may assume encouraging responses to inspire and spread hope. Our goal is to explore the underlying trends in positive and negative sentiment with respect to disasters and geographically related sentiment. In this paper, we propose a novel visual analytics framework for sentiment visualization of geo-located Twitter data. The proposed framework consists of two components, sentiment modeling and geographic visualization. In particular, we provide an entropy-based metric to model sentiment contained in social media data. The extracted sentiment is further integrated into a visualization framework to explore the uncertainty of public opinion. We explored Ebola Twitter dataset to show how visual analytics techniques and sentiment modeling can reveal interesting patterns in disaster scenarios.
SUPER: Towards the use of Social Sensors for Security Assessments and Proactive Management of Emergencies BIBAFull-Text 1217-1220
  Richard McCreadie; Karolin Kappler; Andreas Kaltenbrunner; Magdalini Kardara; Craig Macdonald; John Soldatos; Iadh Ounis
Social media statistics during recent disasters (e.g. the 20 million tweets relating to 'Sandy' storm and the sharing of related photos in Instagram at a rate of 10/sec) suggest that the understanding and management of real-world events by civil protection and law enforcement agencies could benefit from the effective blending of social media information into their resilience processes. In this paper, we argue that despite the widespread use of social media in various domains (e.g. marketing/branding/finance), there is still no easy, standardized and effective way to leverage different social media streams -- also referred to as social sensors -- in security/emergency management applications. We also describe the EU FP7 project SUPER (Social sensors for secUrity assessments and Proactive EmeRgencies management), started in 2014, which aims to tackle this technology gap.
Thematically Analysing Social Network Content During Disasters Through the Lens of the Disaster Management Lifecycle BIBAFull-Text 1221-1226
  Sophie Parsons; Peter M. Atkinson; Elena Simperl; Mark Weal
Social Networks such as Twitter are often used for disseminating and collecting information during natural disasters. The potential for its use in Disaster Management has been acknowledged. However, more nuanced understanding of the communications that take place on social networks are required to more effectively integrate this information into the processes within disaster management. The type and value of information shared should be assessed, determining the benefits and issues, with credibility and reliability as known concerns. Mapping the tweets in relation to the modelled stages of a disaster can be a useful evaluation for determining the benefits/drawbacks of using data from social networks, such as Twitter, in disaster management. A thematic analysis of tweets' content, language and tone during the UK Storms and Floods 2013/14 was conducted. Manual scripting was used to determine the official sequence of events, and classify the stages of the disaster into the phases of the Disaster Management Lifecycle, to produce a timeline. Twenty-five topics discussed on Twitter emerged, and three key types of tweets, based on the language and tone, were identified. The timeline represents the events of the disaster, according to the Met Office reports, classed into B. Faulkner's Disaster Management Lifecycle framework. Context is provided when observing the analysed tweets against the timeline. This illustrates a potential basis and benefit for mapping tweets into the Disaster Management Lifecycle phases. Comparing the number of tweets submitted in each month with the timeline, suggests users tweet more as an event heightens and persists. Furthermore, users generally express greater emotion and urgency in their tweets.
   This paper concludes that the thematic analysis of content on social networks, such as Twitter, can be useful in gaining additional perspectives for disaster management. It demonstrates that mapping tweets into the phases of a Disaster Management Lifecycle model can have benefits in the recovery phase, not just in the response phase, to potentially improve future policies and activities.
D-Sieve: A Novel Data Processing Engine for Efficient Handling of Crises-Related Social Messages BIBAFull-Text 1227-1232
  Soudip Roy Chowdhury; Hemant Purohit; Muhammad Imran
Existing literature demonstrates the usefulness of system-mediated algorithms, such as supervised machine learning for detecting classes of messages in the social-data stream (e.g., topically relevant vs. irrelevant). The classification accuracies of these algorithms largely depend upon the size of labeled samples that are provided during the learning phase. Other factors such as class distribution, term distribution among the training set also play an important role on classifier's accuracy. However, due to several reasons (money / time constraints, limited number of skilled labelers etc.), a large sample of labeled messages is often not available immediately for learning an efficient classification model. Consequently, classifier trained on a poor model often mis-classifies data and hence, the applicability of such learning techniques (especially for the online setting) during ongoing crisis response remains limited. In this paper, we propose a post-classification processing step leveraging upon two additional content features-stable hashtag association and stable named entity association, to improve the classification accuracy for a classifier in realistic settings. We have tested our algorithms on two crisis datasets from Twitter (Hurricane Sandy 2012 and Queensland Floods 2013), and compared our results against the results produced by a "best-in-class'' baseline online classifier. By showing the consistent better quality results than the baseline algorithm i.e., by correctly classifying the misclassified data points from the prior step (false negative and false positive to true positive and true negative classes, respectively), we demonstrate the applicability of our approach in practice.
Twitter Floods when it Rains: A Case Study of the UK Floods in early 2014 BIBAFull-Text 1233-1238
  Antonia Saravanou; George Valkanas; Dimitrios Gunopulos; Gennady Andrienko
Twitter is one of the most prominent social media platforms nowadays. A primary reason that has brought the medium at the spotlight of academic attention is its real-time nature, with people constantly uploading information regarding their surroundings. This trait, coupled with the service's data access policy for researchers and developers, has allowed the community to explore Twitter's potential as a news reporting tool. Finding out promptly about newsworthy events can prove extremely useful in crisis management situations. In this paper, we explore the use of Twitter as a mechanism used in disaster relief, and consequently in public safety. In particular, we perform a case study on the floods that occurred in the United Kingdom during January 2014, and how these were reflected on Twitter, according to tweets (i.e., posts) submitted by the users. We present a systematic algorithmic analysis of tweets collected with respect to our use case scenario, supplemented by visual analytic tools. Our objective is to identify meaningful and effective ways to take advantage of the wealth of Twitter data in crisis management, and we report on the findings of our analysis.
Combining Automatic and Manual Approaches: Towards a Framework for Discovering Themes in Disaster-related Tweets BIBAFull-Text 1239-1244
  Leif Romeritch Syliongka; Alron Jan Lam; Cheryll Ruth Soriano; Ma. Divina Gracia Roldan; Francisco Magno; Charibeth Cheng; Nathaniel Oco
In this paper, we present a framework that combines automatic and manual approaches to discover themes in disaster-related tweets. As case study, we decided to focus on tweets related to typhoon Haiyan, which caused billions of dollars in damages. We collected tweets from November 2013 to March 2014 and used the local typhoon name "Yolanda" as the filter. Data association was used to expand the tweet set and k-means clustering was then applied. Clusters with high number of instances were subjected to open coding for labeling. The Silhouette indices ranged from 0.27 to 0.50. Analyses reveal that the use of automated Natural Language Processing (NLP) approach has the potential to deal with huge volumes of tweets by clustering frequently occurring words and phrases. This complements the manual approach to surface themes from a more manageable set of tweet pool, allowing for a more nuanced analysis of tweets from a human expert. As application, the themes identified during open coding were used as labels to train a classifier system. Future work could explore on using topic models and focusing on specific content or issues, such as natural calamities and citizen's participation in addressing these.
The Case for Readability of Crisis Communications in Social Media BIBAFull-Text 1245-1250
  Irina Temnikova; Sarah Vieweg; Carlos Castillo
The readability of text documents has been studied from a linguistic perspective long before people began to regularly communicate via Internet technologies. Typically, such studies look at books or articles containing many paragraphs and pages. However, the readability of short messages comprising a few sentences, common on today's social networking sites and microblogging services, has received less attention from researchers working on "readability". Emergency management specialists, crisis response practitioners, and scholars have long recognized that clear communication is essential during crises. To the best of our knowledge, the work we present here is the first to study the readability of crisis communications posted on Twitter-by governments, non-governmental organizations, and mainstream media. The data we analyze is comprised of hundreds of tweets posted during 15 different crises in English-speaking countries, which happened between 2012 and 2013. We describe factors which negatively affect comprehension, and consider how understanding can be improved.
   Based on our analysis and observations, we conclude with several recommendations for how to write brief crisis messages on social media that are clear and easy to understand.

TargetAd 2015

Large-scale Contextual Query-to-Ad Matching and Retrieval System for Sponsored Search: (Abstract) BIBAFull-Text 1253
  Ricardo Baeza-Yates; Nemanja Djuric; Mihajlo Grbovic; Vladan Radosavljevic; Fabrizio Silvestri
Semantic embeddings of words (or objects in general) into a vector space have proven to be a powerful tool in many applications. In this talk we are going to show one possible application of semantic embeddings to sponsored search. Sponsored search represents the major source of revenue for web search engines and it is based on the following mechanism: each advertiser maintains a list of keywords they deem of interest with regards to their business. According to this targeting model, when a query is issued, all advertisers with a matching keyword are entered into an auction according to the amount they bid for the query, and the winner gets to show their ad, usually paying the next largest bid (this is called second price). The main challenges is that a query may not match many keywords, resulting in lower auction value, lower ad quality, and lost revenue for both, advertisers and publishers. We address them by applying semantic embeddings to this problem by learning how to project queries and ads in a common embedding, thus sharing the same feature space. The major novelty of the techniques we show is that learning is done by jointly modeling their content (words in queries and ad metadata), as well as their context within a search session. This model has several advantages and can be applied to at least three tasks. First, it can be used to generate query rewrites with a specific bias towards rewrites able to match relevant advertising. Second, it can be used also to retrieve for a given a query a set of relevant ads to be sent to the auction phase. Third, given an ad we are able to retrieve all the queries for which that ad can be considered relevant. The major advantage of learning both content and context embeddings is in the fact that a context-based model may suffer from coverage issue: if a query or an ad does not appear in the training set it cannot be treated by the model; content-based embeddings instead can be used to also build models capturing similarities between content, e.g. for a query not appearing in the model built we may capture some of its sub-queries by using content vectors. Another very interesting characteristic of this method is that all the tasks mentioned above are basically solved by means of a simple K-nearest neighbor search over the set of vectors in the embedding. The method has been trained up to 12 billion sessions, one of the largest corpora reported so far. We report offline and online experimental results, as well as post-deployment metrics. The results show that this approach significantly outperforms existing state-of-the-art, substantially improving a number of key business metrics.
Ads Selection At Twitter BIBAFull-Text 1255
  Yue Lu; Sandeep Pandey
Online advertising is a multi-billion dollar industry and it also serves as the major revenue source for Twitter Inc. In this talk, we present the ads selection pipeline at Twitter, using Promoted Tweets in Home Timelines as an example. The pipeline starts from targeting, where we model Twitter users' attributes offline, e.g. user gender, age, interest etc, so that we can match them with advertisers' specified audience criteria. The second critical component is user engagement rate prediction, where we employ a large-scale online learning system to do real-time training and prediction with rich features. Lastly, we run a second price auction based on the predictions, advertisers' bids and some other optimization parameters. We will present a series of case studies drawn from recent experiments in the setting of the deployed system used at Twitter.
Serving Ads to "Yahoo Answers" Occasional Visitors BIBAFull-Text 1257-1262
  Michal Aharon; Amit Kagian; Yohay Kaplan; Raz Nissim; Oren Somekh
Modern ad serving systems can benefit when allowed to accumulate user information and use it as part of the serving algorithm. However, this often does not coincide with how the web is used. Many domains will see users for only brief interactions, as users enter a domain through a search result or social media link and then leave. Having access to little or no user information and no ability to assemble a user profile over a prolonged period of use, we would still like to leverage the information we have to the best of our ability. In this paper we attempt several methods of improving ad serving for occasional users, including leveraging user information that is still available, content analysis of the page, information about the page's content generators and historical breakdown of visits to the page. We compare and combine these methods in a framework of a collaborative filtering algorithm, test them on real data collected from Yahoo Answers, and achieve significant improvements over baseline algorithms.
Fast and Accurate Maximum Inner Product Recommendations on Map-Reduce BIBAFull-Text 1263-1268
  Rob Hall; Josh Attenberg
Personalization has become a predominant theme in online advertising; the internet allows advertisers to target only those users with the greatest chances of engagement, maximizing the probability of success and user happiness. However, a naïve approach to matching users with their most suitable content scales proportionally to the product of the cardinalities of the user and content sets. For advertisers with large portfolios, this quickly becomes intractable. In this work, we address this more general top-k personalization problem, giving a scalable method to produce recommendations based on personalization models where the affinity between a user and an item is captured by an inner product (i.e., most matrix factorization models). We first transform the problem into finding the k-nearest neighbors among the items for each user, then approximate the solution via a method which is particularly suited for use on a map-reduce cluster. We empirically show that our method is between 1 and 2 orders of magnitude faster than previous work, while maintaining excellent approximation quality. Additionally, we provide an open-source implementation of our proposed method, this implementation is used in production at Etsy for a number of large scale personalization systems, and is the same code as used in the experiments below.
Targeted Content for a Real-Time Activity Feed: For First Time Visitors to Power Users BIBAFull-Text 1269-1274
  Diane Hu; Tristan Schneiter
The Activity Feed is Etsy's take on the ubiquitous "web feed" -- a continuous stream of aggregated content, personalized for each user. These streams have become the de facto means of serving advertisements in the context of social media. Any visitor to Facebook or Twitter has seen advertisements placed on their web feed. For Etsy, an online marketplace for handmade and vintage goods with over 29 million unique items, the AF makes the marketplace feel a bit smaller for users. It enables discovery of relevant content, including activities from their social graph, recommended shops and items, and new listings from favorite shops. At the same time, Etsy's AF provides a platform for presenting users with targeted content, as well as advertisements, served alongside relevant and timely content.
   One of the biggest challenges for building such a feed is providing an engaging experience for all users across Etsy. Some users are first-time visitors who may find Etsy to be overwhelming. Others are long-time power users who already know what they're looking for and how to find it. In this work, we describe solutions to the challenges encountered while delivering targeted content to our tens of million of users. We also cover our means of adapting to each user's actions, evolving our targeted content offerings as the user's familiarity with Etsy grows. Finally, we discuss the impact of our system through extensive experimentation on live traffic, and show how these improvements have led to increased user engagement.
Subjective Similarity: Personalizing Alternative Item Recommendations BIBAFull-Text 1275-1279
  Tolga Könik; Rajyashree Mukherjee; Jayasimha Katukuri
We present a new algorithm for recommending alternatives to a given item in an e-commerce setting. Our algorithm is an incremental improvement over an earlier system, which recommends similar items by first assigning the input item to clusters and then selecting best quality items within those clusters. The original algorithm does not consider the recent context and our new algorithm improves the earlier system by personalizing the recommendations to user intentions. The system measures user intention using the recent queries, which are used to determine the level of abstraction in similarity and relative importance of similarity dimensions. We show that user engagement increases when recommended item titles share more terms with most recent queries. Moreover, the new algorithm increases query coverage without sacrificing input item similarity and item quality.
Search Query Categorization at Scale BIBAFull-Text 1281-1286
  Michal Laclavik; Marek Ciglan; Sam Steingold; Martin Seleng; Alex Dorman; Stefan Dlugolinsky
State of the art query categorization methods usually exploit web search services to retrieve the best matching web documents and map them to a given taxonomy of categories. This is effective but impractical when one does not own a web corpus and has to use a 3rd party web search engine API. The problem lies in performance and in financial costs. In this paper, we present a novel, fast and scalable approach to categorization of search queries based on a limited intermediate corpus: we use Wikipedia as the knowledge base. The presented solution relies on two steps: first a query is mapped to the relevant Wikipedia pages; second, the retrieved documents are categorized into a given taxonomy. We approach the first challenge as an entity search problem and present a new document categorization approach for the second step. On a standard data set, our approach achieves results comparable to the state-of-the-art approaches while maintaining high performance and scalability.
Leveraging Semantic Web technologies for more relevant E-tourism Behavioral Retargeting BIBAFull-Text 1287-1292
  Chun Lu; Milan Stankovic; Philippe Laublet
The e-tourism is today an important field of the e-commerce. One specificity of this field is that consumers spend much time comparing many options on multiple websites before purchasing. It's easy for consumers to forget the viewed offers or websites. The Behavioral Retargeting (BR) is a widely used technique for online advertising. It leverages consumers' actions on advertisers' websites and displays relevant ads on publishers' websites. In this paper, we're interested in the relevance of the displayed ads in the e-tourism field. We present MERLOT 1, a Semantic-based travel destination recommender system that can be deployed to improve the relevance of BR in the e-tourism field. We conducted a preliminary experiment with the real data of a French travel agency. The results of 33 participants showed very promising results with regards to the baseline according to all used metrics. By this paper, we wish to provide a novel viewpoint to address the BR relevance problem, different from the dominating machine learning approaches.
People's Perceptions of Personalized Ads BIBAFull-Text 1293-1298
  Katie O'Donnell; Henriette Cramer
Advertising is key to the business model of many online services. Personalization aims to make ads more relevant for users and more effective for advertisers. However, relatively few studies into user attitudes towards personalized ads are available. We present a San Francisco Bay Area survey (N=296) and in-depth interviews (N=24) with teens and adults. People are divided and often either (strongly) agreed or disagreed about utility or invasiveness of personalized ads and associated data collection. Mobile ads were reported to be less relevant than those on desktop. Participants explained ad personalization based on their personal previous behaviors and guesses about demographic targeting. We describe both metrics improvements as well as opportunities for improving online advertising by focusing on positive ad interactions reported by our participants, such as personalization focused not just on product categories but specific brands and styles, awareness of life events, and situations in which ads were useful or even inspirational.
AdAlyze Redux: Post-Click and Post-Conversion Text Feature Attribution for Sponsored Search Ads BIBAFull-Text 1299-1304
  Thomas Steiner
In this paper, we present our ongoing research on an ads quality testing tool that we call AdAlyze Redux. This tool allows advertisers to get individual best practice recommendations based on an expandable set of textual ads features, tailored to exactly the ads in an advertiser's set of accounts. This lets them optimize their ad copies against the common online advertising key performance indicators clickthrough rate and, if available, conversion rate. We choose the Web as the tool's platform and automatically generate the analyses as platform-independent HTML5 slides and full reports.
Ad Recommendation Systems for Life-Time Value Optimization BIBAFull-Text 1305-1310
  Georgios Theocharous; Philip S. Thomas; Mohammad Ghavamzadeh
The main objective in the ad recommendation problem is to find a strategy that, for each visitor of the website, selects the ad that has the highest probability of being clicked. This strategy could be computed using supervised learning or contextual bandit algorithms, which treat two visits of the same user as two separate independent visitors, and thus, optimize greedily for a single step into the future. Another approach would be to use reinforcement learning (RL) methods, which differentiate between two visits of the same user and two different visitors, and thus, optimizes for multiple steps into the future or the life-time value (LTV) of a customer. While greedy methods have been well-studied, the LTV approach is still in its infancy, mainly due to two fundamental challenges: how to compute a good LTV strategy and how to evaluate a solution using historical data to ensure its "safety" before deployment. In this paper, we tackle both of these challenges by proposing to use a family of off-policy evaluation techniques with statistical guarantees about the performance of a new strategy. We apply these methods to a real ad recommendation problem, both for evaluating the final performance and for optimizing the parameters of the RL algorithm. Our results show that our LTV optimization algorithm equipped with these off-policy evaluation techniques outperforms the greedy approaches. They also give fundamental insights on the difference between the click through rate (CTR) and LTV metrics for performance evaluation in the ad recommendation problem.

TempWeb 2015

Large-scale Network Analytics: Diffusion-based Computation of Distances and Geometric Centralities BIBAFull-Text 1313
  Paolo Boldi
Given a large complex network, which of its nodes are more central? This question emerged in many contexts (e.g., sociology, psychology and computer science), and gave rise to a large range of proposed centrality measures. Providing a sufficiently general and mathematically sound classification of these measures is challenging: on one hand, it requires that one can suggest some simple, basic properties that a centrality measure should exhibit; on the other hand, it calls for innovative algorithms that allow an efficient computation of these measures on large real networks. HyperBall is a recently proposed tool that accesses the graph in a semi-streaming fashion and is at the same time able to compute the distance distribution and to approximate all geometric (i.e., distance-based) centralities. It uses a very small amount of core memory, thanks to the application of HyperLogLog counters, and exhibits high, guaranteed accuracy.
Important Events in the Past, Present, and Future BIBAFull-Text 1315-1320
  Abdalghani Abujabal; Klaus Berberich
We address the problem of identifying important events in the past, present, and future from semantically-annotated large-scale document collections. Semantic annotations that we consider are named entities (e.g., persons, locations, organizations) and temporal expressions (e.g., during the 1990s). More specifically, for a given time period of interest, our objective is to identify, rank, and describe important events that happened. Our approach P2F Miner makes use of frequent itemset mining to identify events and group sentences related to them. It uses an information-theoretic measure to rank identified events. For each of them, it selects a representative sentence as a description. Experiments on ClueWeb09 using events listed in Wikipedia year articles as ground truth show that our approach is effective and outperforms a baseline based on statistical language models.
Dicer: A Framework for Controlled, Large-Scale Web Experiments BIBAFull-Text 1321-1326
  Sarah Chasins; Phitchaya Mangpo Phothilimthana
As dynamic, complex, and non-deterministic webpages proliferate, running controlled web experiments on live webpages is becoming increasingly difficult. To compare algorithms that take webpages as inputs, an experimenter must worry about ever-changing webpages, and also about scalability. Because webpage contents are constantly changing, experimenters must intervene to hold webpages constant, in order to guarantee a fair comparison between algorithms. Because webpages are increasingly customized and diverse, experimenters must test web algorithms over thousands of webpages, and thus need to implement their experiments efficiently. Unfortunately, no existing testing frameworks have been designed for this type of experiment. We introduce Dicer, a framework for running large-scale controlled experiments on live webpages. Dicer's programming model allows experimenters to easily 1) control when to enforce a same-page guarantee and 2) parallelize test execution. The same-page guarantee ensures that all loads of a given URL produce the same response. The framework utilizes a specialized caching proxy server to enforce this guarantee. We evaluate tool on a dataset of 1,000 real webpages, and find it upholds the same-page guarantee with little overhead.
You Will Get Mail! Predicting the Arrival of Future Email BIBAFull-Text 1327-1332
  Iftah Gamzu; Zohar Karnin; Yoelle Maarek; David Wajc
The majority of Web email is known to be generated by machines even when one excludes spam. Many machine-generated email messages such as invoices or travel itineraries are critical to users. Recent research studies establish that causality relations between certain types of machine-generated email messages exist and can be mined. These relations exhibit a link between a given message to a past message that gave rise to its creation. For example, a shipment notification message can often be linked to a past online purchase message. Instead of studying how an incoming message can be linked to the past, we propose here to focus on predicting future email arrival as implied by causality relations. Such a prediction method has several potential applications, ranging from improved ad targeting in up sell scenarios to reducing false positives in spam detection.
   We introduce a novel approach for predicting which types of machine-generated email messages, represented by so-called "email templates", a user should receive in future time windows. Our prediction approach relies on (1) statistically inferring causality relations between email templates, (2) building a generative model that explains the inbox of each user using those causality relations, and (3) combining those results to predict which email templates are likely to appear in future time frames. We present preliminary experimental results and some data insights obtained by analyzing several million inboxes of Yahoo Mail users, who voluntarily opted-in for such research.
Learning Temporal Tagging Behaviour BIBAFull-Text 1333-1338
  Toni Gruetze; Gary Yao; Ralf Krestel
Social networking services, such as Facebook, Google+, and Twitter are commonly used to share relevant Web documents with a peer group. By sharing a document with her peers, a user recommends the content for others and annotates it with a short description text. This short description yield many chances for text summarization and categorization. Because today's social networking platforms are real-time media, the sharing behaviour is subject to many temporal effects, i.e., current events, breaking news, and trending topics. In this paper, we focus on time-dependent hashtag usage of the Twitter community to annotate shared Web-text documents. We introduce a framework for time-dependent hashtag recommendation models and introduce two content-based models. Finally, we evaluate the introduced models with respect to recommendation quality based on a Twitter-dataset consisting of links to Web documents that were aligned with hashtags.
Learning to Detect Event-Related Queries for Web Search BIBAFull-Text 1339-1344
  Nattiya Kanhabua; Tu Ngoc Nguyen; Wolfgang Nejdl
In many cases, a user turns to search engines to find information about real-world situations, namely, political elections, sport competitions, or natural disasters. Such temporal querying behavior can be observed through a significant number of event-related queries generated in web search. In this paper, we study the task of detecting event-related queries, which is the first step for understanding temporal query intent and enabling different temporal search applications, e.g., time-aware query auto-completion, temporal ranking, and result diversification. We propose a two-step approach to detecting events from query logs. We first identify a set of event candidates by considering both implicit and explicit temporal information needs. The next step further classifies the candidates into two main categories, namely, event or non-event. In more detail, we leverage different machine learning techniques for query classification, which are trained using the feature set composed of time series features from signal processing, along with features derived from click-through information, and standard statistical features. In order to evaluate our proposed approach, we conduct an experiment using two real-world query logs with manually annotated relevance assessments for 837 events. To this end, we provide a large set of event-related queries made available for fostering research on this challenging task.
Temporal Patterns in Online Food Innovation BIBAFull-Text 1345-1350
  Tomasz Kusmierczyk; Christoph Trattner; Kjetil Nørvåg
Since innovation plays an important role in the context of food, as evident in how successful chefs, restaurants or cuisines in general evolve over time, we were interested in exploring this dimension from a more virtual perspective. In particular, the paper presents results of a study that was conducted in the context of a large-scale German online food community forum to explore another important dimension of online food recipe production, namely known as online food innovation. The study shows interesting findings and temporal patterns in terms of how online food recipe innovation takes place.
Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving BIBAFull-Text 1351-1355
  Jimmy Lin
Warcbase is an open-source platform for storing, managing, and analyzing web archives using modern "big data" infrastructure on commodity clusters -- specifically, HBase for storage and Hadoop for data analytics. This paper describes an effort to scale "down" Warcbase onto a Raspberry Pi, an inexpensive single-board computer about the size of a deck of playing cards. Apart from an interesting technology demonstration, such a design presents new opportunities for personal web archiving, in enabling a low-cost, low-power, portable device that is able to continuously capture a user's web browsing history -- not only the URLs of the pages that a user has visited, but the contents of those pages -- and allowing the user to revisit any previously-encountered page, as it appeared at that time. Experiments show that data ingestion throughput and temporal browsing latency are adequate with existing hardware, which means that such capabilities are already feasible today.
Mining Relevant Time for Query Subtopics in Web Archives BIBAFull-Text 1357-1362
  Tu Ngoc Nguyen; Nattiya Kanhabua; Wolfgang Nejdl; Claudia Niederée
With the reflection of nearly all types of social cultural, societal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for temporal content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great benefit for expert users such as journalists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is completely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversification. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawling times) is extremely difficult. We introduce a brute-force approach to detect a time-reliable sub-collection and propose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.
Exploring Supervised Methods for Temporal Link Prediction in Heterogeneous Social Networks BIBAFull-Text 1363-1368
  Nataliia Rümmele; Ryutaro Ichise; Hannes Werthner
In the link prediction problem, formulated as a binary classification problem, we want to classify each pair of disconnected nodes in the network whether they will be connected by a link in the future. We study link formation in social networks with two types of links over several time periods. To solve the link prediction problem, we follow the approach of counting 3-node graphlets and suggest three extensions to the original method. By performing experiments on two real-world social networks we show that the new methods have a predictive power, however, network evolution cannot be explained by one specific feature at all time points. We also observe that some network properties can point at features which are more effective for temporal link prediction.
Disassortative Degree Mixing and Information Diffusion for Overlapping Community Detection in Social Networks (DMID) BIBAFull-Text 1369-1374
  Mohsen Shahriari; Sebastian Krott; Ralf Klamma
In this paper we propose a new two-phase algorithm for overlapping community detection (OCD) in social networks. In the first phase, called disassortative degree mixing, we identify nodes with high degrees through a random walk process on the row-normalized disassortative matrix representation of the network. In the second phase, we calculate how closely each node of the network is bound to the leaders via a cascading process called network coordination game. We implemented the algorithm and four additional ones as a Web service on a federated peer-to-peer infrastructure. Comparative test results for small and big real world networks demonstrated the correct identification of leaders, high precision and good time complexity. The Web service is available as open source software.
Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model BIBAFull-Text 1375-1380
  Andreas Spitz; Jannik Strötgen; Thomas Bögel; Michael Gertz
Approaches in support of the extraction and exploration of temporal information in documents provide an important ingredient in many of today's frameworks for text analysis. Methods range from basic techniques, primarily the extraction of temporal expressions and events from documents, to more sophisticated approaches such as ranking of documents with respect to their temporal relevance to some query term or the construction of timelines. Almost all of these approaches operate on the document level, that is, for a collection of documents a timeline is extracted or a ranked list of documents is returned for a temporal query term. In this paper, we present an approach to characterize individual dates, which can be of different granularities, and terms. Given a query date, a ranked list of terms is determined that are highly relevant for that date and best summarize the date. Analogously, for a query term, a ranked list of dates is determined that best characterize the term. Focusing on just dates and single terms as they occur in documents provides a fine-grained query and exploration method for document collections. Our approach is based on a weighted bipartite graph representing the co-occurrences of time expressions and terms in a collection of documents. We present different measures to obtain a ranked list of dates and terms for a query term and date, respectively. Our experiments and evaluation using Wikipedia as a document collection show that our approach provides an effective means in support of date and temporal term summarization.

WDS4SC 2015

Architecture and Implementation Issues, Towards a Dynamic Waste Collection Management System BIBAFull-Text 1383-1388
  George Asimakopoulos; Sotiris Christodoulou; Andreas Gizas; Vassilios Triantafillou; Giannis Tzimas; John Gialelis; Artemios Voyiatzis; Dimitris Karadimas; Andreas Papalambrou
Dynacargo is an ongoing research project that introduces a breakthrough approach for cargo management systems, as it places the hauled cargos in the center of a haulage information management system, instead of the vehicle. Dynacargo attempts to manage both distribution and collection processes, providing an integrated approach. This paper presents the Dynacargo architectural modules and interrelations between them, as well as the research issues and development progress of some selected modules. In the context of Dynacargo project, a set of durable, low cost RFID tags are placed on waste bins in order to produce crucial data that is fed via diverse communication channels into the cargo management system. Besides feeding the management system with raw data from waste bins, data mining techniques are used on archival data, in order to predict current waste bins fill status. Moreover easy-to-use mobile and web applications will be developed to encourage citizens to participate and become active information producers and consumers. Dynacargo project overall aim is to develop a near real-time monitoring system that monitors and transmits waste bins' fill level, in order to dynamically manage the waste collection more efficiently by minimizing distances covered by refuse vehicles, relying on efficient routing algorithms.
An Urban Data Profiler BIBAFull-Text 1389-1394
  Daniel Castellani Ribeiro; Huy T. Vo; Juliana Freire; Cláudio T. Silva
Large volumes of urban data are being made available through a variety of open portals. Besides promoting transparency, these data can bring benefits to government, science, citizens and industry. It is no longer a fantasy to ask "if you could know anything about a city, what do you want to know" and to ponder what could be done with that information. However, the great number and variety of datasets creates a new challenge: how to find relevant datasets. While existing portals provide search interfaces, these are often limited to keyword searches over the limited metadata associated each dataset, for example, attribute names and textual description. In this paper, we present a new tool, UrbanProfiler, that automatically extracts detailed information from datasets. This information includes attribute types, value distributions, and geographical information, which can be used to support complex search queries as well as visualizations that help users explore and obtain insight into the contents of a data collection. Besides describing the tool and its implementation, we present case studies that illustrate how the tool was used to explore a large open urban data repository.
A Smart City Data Model based on Semantics Best Practice and Principles BIBAFull-Text 1395-1400
  Sergio Consoli; Misael Mongiovic; Andrea G. Nuzzolese; Silvio Peroni; Valentina Presutti; Diego Reforgiato Recupero; Daria Spampinato
Data management is crucial in modern smart cities. A good data model for smart cities has to be able to describe and integrate data from multiple domains, such as geographic information, public transportation, road maintenance, waste collection, and urban faults management. We describe our approach for creating a semantic platform for the Municipality of Catania, one of the main cities in Southern Italy. The ultimate goal is to boost the metropolis towards the route of a modern smart city and improve urban life. Our platform exhibits a consistent, minimal and comprehensive semantic data model for the city based on the Linked Open Data paradigm. Both the model and the data are publically accessible thorough dedicated user-friendly services, which allow citizens to observe and interact with the work of the public administration. Our platform also enables interested businesses and programmers to develop front-end services on the top of it. We describe the methodology used to extract data from sources, enrich them, building an ontology that describes them and publish them under the Linked Open Data paradigm. We include in our description employed tools and technologies. Our methodology is based on the standards of the W3C, on good practices of ontology design, on the guidelines issued by the Agency for Digital Italy and the Italian Index of Public Administration, as well as on the in-depth experience of the researchers in this field.
Developing Smart Cities Services through Semantic Analysis of Social Streams BIBAFull-Text 1401-1406
  Cataldo Musto; Giovanni Semeraro; Marco de Gemmis; Pasquale Lops
This paper presents a domain-agnostic framework for intelligent processing of textual streams coming from social networks. The framework implements a pipeline of techniques for semantic representation, sentiment analysis, automatic content classification, and provides an analytics console to get some findings from the extracted data. The effectiveness of the platform has already been proved by deploying it in two smart cities-related scenarios: in the first it was exploited to monitor the recovering state of the social capital of L'Aquila's city after the dreadful earthquake of April 2009, while in the latter a semantic analysis of the content posted on social networks was performed to build a map of the most at-risk areas of the Italian territory.
   In both scenarios, the outcomes resulting from the analysis confirmed the insight that the adoption of methodologies for intelligent and semantic analysis of textual content can provide interesting findings useful to improve the understanding of very complex phenomena.
Smart Urban Planning Support through Web Data Science on Open and Enterprise Data BIBAFull-Text 1407-1412
  Gloria Re Calegari; Irene Celino
Urban information abound today in open Web sources as well as in enterprise datasets. However, the maintenance and update of this wealth of information about cities comes at different costs: some datasets are automatically produced, while other sources require expensive workflows including human intervention. Regression techniques can be employed to predict a costly dataset from a set of cheaper information sources. In this paper we present our early experiments in predicting land use and demographics from heterogeneous open and enterprise datasets referring to the city of Milano. The results are encouraging, thus demonstrating that a data science approach leveraging diverse data can be actually worth for a smarter urban planning support.

WebET 2015

Web-Based Context-Aware Science Learning BIBFull-Text 1415-1418
  Jacqueline Bourdeau; Thomas Forissier; Yves Mazabraud; Roger Nkambou
Visualisation and Analysis of Students' Interaction Data in Exploratory earning Environments BIBAFull-Text 1419-1424
  Manolis Mavrikis; Zheng Zhu; Sergio Gutierrez-Santos; Alexandra Poulovassilis
Log files from adaptive Exploratory Learning Environments can contain prohibitively large quantities of data for visualisation and analysis. Moreover, it is hard to know in advance what data is required for analytical purposes. Using a microworld for secondary algebra as a case study, we discuss how students' interaction data can be transformed into a data warehouse in order to allow its visualisation and exploration using online analytical processing (OLAP) tools. We also present some additional, more targeted, visualisations of the students' interaction data. We demonstrate the possibilities that these visualisations provide for exploratory data analysis, enabling confirmation or contradiction of expectations that pedagogical experts may have about the system and ultimately providing both empirical evidence and insights for its further development.
A Case Study on the Use of Semantic Web Technologies for Learner Guidance BIBAFull-Text 1425-1430
  Andrea Zielinski; Jürgen Bock
Personalized learning pathways have been advocated by didactic experts to overcome the problem of disorientation and information overload in technology enhanced learning (TEL). They are not only relevant for providing user-adaptive navigational support, but can also be used for composing learning objects into new personalized courses (sequencing and assembly). In this paper we investigate, how Semantic Web technologies can effectively support these tasks, based on a proper representation of learning objects and courses according to didactic requirements. We claim that both eLearning tasks, adaptive navigation and course assembly, call for a representational model that can capture the syntax and semantics of learning pathways adequately. In particular: (1) a new type of navigation that takes into account ordering information and the hierarchical structure of an eLearning course complemented with adaptive constraints; (2) closely tied to it, a semantic layer to guarantee interoperability and validation of the correctness of the learning pathway descriptions. We investigate to what extend Semantic Web Languages like RDF/S and OWL are expressive enough to handle different aspects of learning pathways. While both share a structural similarity with DAGs, only OWL ontologies -- formally underpinned by description logics (DLs) -- are expressive enough to validate the correctness of the data and infer semantically related learning resources on the pathway. For tasks that are more related to the syntax of learning pathways, in particular navigation similar to a guided tour, we test the time efficiency on various synthetic OWL ontologies using the HermiT reasoner. Experimental results show that the course structure and the density of the knowledge graph impact on the performance. We claim that in a dynamically changing environment, where the computation of reachability of a vertex is computed on demand at run-time, OWL-based reasoning does not scale up well. Using a real-world case study from the eLearning domain, we compare an OWL 2 DL implementation with an equivalent graph algorithm implementation with respect to time efficiency.

WebQuality 2015

On the Role of Data Quality in Improving Web Information Value BIBAFull-Text 1433
  Cinzia Cappiello
n today's information era, every day more and more information is generated and people, on the one hand, have advantages due the increasing support in decision processes and, on the other hand, are experiencing difficulties in the selection of the right data to use. That is, users may leverage on more data but at the same time they may not be able to fully value such data since they lack the necessary knowledge about their provenance and quality. The data quality research area provides quality assessment and improvement methods that can be a valuable support for users that have to deal with the complexity of Web content. In fact, such methods help users to identify the suitability of information for their purposes. Most of the methods and techniques proposed, however, address issues for structured data and/or for defined contexts. Clearly, they cannot be easily used on the Web, where data come from heterogeneous sources and the context of use is most of the times unknown.
   In this keynote, the need for new assessment techniques is highlighted together with the importance of tracking data provenance as well as the reputation and trustworthiness of the sources. In fact, it is well known that the increase of data volume often corresponds to an increase of value, but to maximize such value the data sources to be used have to carefully analyzed, selected and integrated depending on the specific context of use. The talk discusses the data quality dimensions necessary to analyze different Web data sources and provides a set of illustrative examples that show how to maximize the quality of gathered information.
Characterizing Credit Card Black Markets on the Web BIBAFull-Text 1435-1440
  Vlad Bulakh; Minaxi Gupta
We study carding shops that sell stolen credit and debit card information online. By bypassing the anti-scrapping mechanisms they use, we find that the prices of cards depend heavily on factors such as the issuing bank, country of origin, and whether the card can be used in brick-and-mortar stores or not. Almost 70% of cards sold by these outfits are priced at or below the cost banks incur in re-issuing them. Ironically, this makes buying their own cards more economical for the banks than re-issuing. We also find that the monthly revenues for the carding shops we study are high enough to justify the risk fraudsters take. Further, inventory at carding outfits seems to follow data breaches and the impact of delayed deployment of the smart chip technology is evident in the disproportionate share the U.S. commands in the underground card fraud economy.
Text Classification Kernels for Quality Prediction over the C3 Data Set BIBAFull-Text 1441-1446
  Balint Daroczy; David Siklois; Robert Palovics; Andras A. Benczur
We compare machine learning methods to predict quality aspects of the C3 dataset collected as a part of the Reconcile project. We give methods for automatically assessing the credibility, presentation, knowledge, intention and completeness by extending the attributes in the C3 dataset by the page textual content. We use Gradient Boosted Trees and recommender methods over the evaluator, site, evaluation triplets and their metadata and combine with text classifiers. In our experiments best results can be reached by the theoretically justified normalized SVM kernel. The normalization can be derived by using the Fisher information matrix of the text content. As the main contribution, we describe the theory of the Fisher matrix and show that SVM may be particularly suitable for difficult text classification tasks.
Identification of Web Spam through Clustering of Website Structures BIBAFull-Text 1447-1452
  Filippo Geraci
Spam websites are domains whose owners are not interested in using them as gates for their activities but they are parked to be sold in the secondary market of web domains. To transform the costs of the annual registration fees in an opportunity of revenues, spam websites most often host a large amount of ads in the hope that someone who lands on the site by chance clicks on some ads. Since parking has become a widespread activity, a large number of specialized companies have come out and made parking a straightforward task that simply requires to set the domain's name servers appropriately.
   Although parking is a legal activity, spam websites have a deep negative impact on the information quality of the web and can significantly deteriorate the performances of most web mining tools. For example these websites can influence search engines results or introduce an extra burden for crawling systems. In addition, spam websites represent a cost for ad bidders that are obliged to pay for impressions or clicks that have a negligible probability to produce revenues.
   In this paper, we experimentally show that spam websites hosted by the same service provider tend to have similar look-and-feel. Exploiting this structural similarity we face the problem of the automatic identification of spam websites. In addition, we use the outcome of the classification for compiling the list of the name servers used by spam websites so that they can be discarded before the first connection just after the first DNS query. A dump of our dataset (including web pages and meta information) and the corresponding manual classification is freely available upon request.
Answer Quality Characteristics and Prediction on an Academic Q&A Site: A Case Study on ResearchGate BIBAFull-Text 1453-1458
  Lei Li; Daqing He; Wei Jeng; Spencer Goodwin; Chengzhi Zhang
Despite various studies on examining and predicting answer quality on generic social Q&A sites such as Yahoo! Answers, little is known about why answers on academic Q&A sites are voted on by scholars who follow the discussion threads to be high quality answers. Using 1021 answers obtained from the Q&A part of an academic social network site ResearchGate (RG), we firstly explored whether various web-captured features and human-coded features can be the critical factors that influence the peer-judged answer quality. Then using the identified critical features, we constructed three classification models to predict the peer-judged rating. Our results identify four main findings. Firstly, responders' authority, shorter response time and greater answer length are the critical features that positively associate with the peer-judged answer quality. Secondly, answers containing social elements are very likely to harm the peer-judged answer quality. Thirdly, an optimized SVM algorithm has an overwhelming advantage over other models in terms of accuracy. Finally, the prediction based on web-captured features had better performance when comparing to prediction on human-coded features. We hope that these interesting insights on ResearchGate's answer quality can help the further design of academic Q&A sites.

WIC 2015

Social Networks and the Semantic Web: A Retrospective of the Past 10 Years BIBAFull-Text 1461
  Peter Mika
Ten years have passed since the appearance of the first major online social networks as well as the standardization of Semantic Web technology. In this presentation, we will look at whether Semantic Web technologies have fulfilled their promise in improving the online networking experience, and conversely, the extent to which online networking helped improve the Semantic Web itself.
Exploriometer: Leveraging Personality Traits for Coverage and Diversity Aware Recommendations BIBAFull-Text 1463-1468
  Evangelos Chatzicharalampous; Zigkolis Christos; Athena Vakali
Since the first introduced Collaborative Filtering Recommenders (CFR) there have been many attempts to improve their performance by enhancing the prediction accuracy. Even though rating prediction is the prevailing paradigm in CFR, there are other issues which have gained significant attention with respect to the content and its variety. Coverage, which constitutes the degree to which recommendations cover the set of available items, is an important factor along with diversity of the items proposed to an individual, often measured by an average dissimilarity between all pairs of recommended items. In this paper, we argue that coverage and diversity cannot be effectively addressed by conventional CFR with pure similarity-based neighborhood creation processes, especially in sparse datasets. Motivated by the need for including wider content characteristics, we propose a novel neighbor selection technique which emphasizes on variety in preferences (to cover polyphony in selection). Our approach consists of a new metric, named "Exploriometer", which acts as a personality trait for users based on their rating behavior. We favor users who are explorers in order to increase polyphony, and subsequently coverage and diversity; but we still select similar users when we create neighborhoods as a solid basis in order to keep accuracy levels high. The proposed approach has been experimented by two real-world datasets (MovieLens and Yahoo! Music ) with coverage, diversity and accuracy aware recommendations extracted by both traditional CFR and CFR enhanced with our neighborhood creation process. We also introduce a new metric, inspired by the Pearson Correlation Coefficient, to estimate the diversity of recommended items. The derived results demonstrate that our neighbor selection technique can enhance coverage and diversity of the recommendations, especially on sparse datasets.
Web Intelligence and Communities BIBAFull-Text 1469-1470
  Pierre Maret; Rajendra Akerkar; Laurent Vercouter
The World Wide Web (WWW) provides precious means for communication, which goes far beyond the traditional communication media. Web-based communities have become imperative spaces for individuals to seek and share expertise. Networks in these communities usually differ in their topology from other networks such as the World Wide Web. In this paper, we explore some research issues of web intelligence and communities. We will also introduce the WI&C'15 workshop's goal and structure.
A Bi-Dimensional User Profile to Discover Unpopular Web Sources BIBAFull-Text 1471-1476
  Romain Noël; Nicolas Malandain; Alexandre Pauchet; Laurent Vercouter; Bruno Grilheres; Stephan Brunessaux
The discovery of new sources of information on a given topic is a prominent problem for Experts in Intelligence Analysis (EIA) who cope with the search of pages on specific and sensitive topics. Their information needs are difficult to express with queries and pages with sensitive content are difficult to find with traditional search engines as they are usually poorly indexed. We propose a double vector to model EIA's information needs, composed of DBpedia resources and keywords, both extracted from Web pages provided by the user. We also introduce a new similarity measure that is used in a Web source discovery system called DOWSER. DOWSER aims at providing users with new sources of information related to their needs without considering the popularity of a page. A series of experiments provides an empirical evaluation of the whole system.
An evaluation of SimRank and Personalized PageRank to build a recommender system for the Web of Data BIBAFull-Text 1477-1482
  Phuong Nguyen; Paolo Tomeo; Tommaso Di Noia; Eugenio Di Sciascio
The Web of Data is the natural evolution of the World Wide Web from a set of interlinked documents to a set of interlinked entities. It is a graph of information resources interconnected by semantic relations, thereby yielding the name Linked Data. The proliferation of Linked Data is for sure an opportunity to create a new family of data-intensive applications such as recommender systems. In particular, since content-based recommender systems base on the notion of similarity between items, the selection of the right graph-based similarity metric is of paramount importance to build an effective recommendation engine. In this paper, we review two existing metrics, SimRank and PageRank, and investigate their suitability and performance for computing similarity between resources in RDF graphs and investigate their usage to feed a content-based recommender system. Finally, we conduct experimental evaluations on a dataset for musical artists and bands recommendations thus comparing our results with two other content-based baselines measuring their performance with precision and recall, catalog coverage, items distribution and novelty metrics.


An Approach to Support Data Integrity for Web Services Using Semantic RESTful Interfaces BIBAFull-Text 1485-1490
  Hermano Albuquerque Lira; Jose Renato Villela Dantas; Bruno de Azevedo Muniz; Tadeu Matos Nunes; Pedro Porfirio Muniz Farias
In the Web, linked data is growing rapidly due to its potential to facilitate data retrieval and data integration. At the same time, relational database systems store a vast amount of data published on the Web. Current linked data in the Web is mainly read only. It allows for integration, navigation, and consultations in large structured datasets, but it still lacks a general concept for reading and writing. This paper proposes a specification, entitled Semantic Data Services (SDS), for RESTful Web services that provide data access. To provide linked data read-write capacity, SDS proposes a mechanism for integrity checking that is analogous to that used in relational databases. SDS implements the Semantic Restful Interface (SERIN) specification. SERIN uses annotations on classes in an ontology, describing the semantic web services that are available to manipulate data. This work extends SERIN specification adding annotations to allow the adoption of data access integrity constraints.
BrowserCloud: A Personal Cloud for Browser Session Migration and Management BIBAFull-Text 1491-1496
  Junjie Feng; Aaron Harwood
Web browsers are de facto clients for an ever-increasing range of web applications. At the same time, web users are accessing these applications from a wide range of devices. This paper presents a solution for runtime browser session migration and management, called BrowserCloud, which allows a user to securely manage multiple browsers from a personal or third-party Cloud service, migrate snapshots of active browser sessions between browsers over different devices, using a robust security module. The design of BrowserCloud is based on browser extensions/plugins that can preserve and restore browser session state, and a PHP server that stores browser sessions securely. We have tested our implementation over a range of increasingly complex web applications, including WebRTC and HTML5 video. To the best of our knowledge, our implementation is the most robust and secure approach to runtime browser session management to date.
Model-driven Testing of RESTful APIs BIBAFull-Text 1497-1502
  Tobias Fertig; Peter Braun
In contrast to the increasing popularity of REpresentational State Transfer (REST), systematic testing of RESTful Application Programming Interfaces (API) has not attracted much attention so far. This paper describes different aspects of automated testing of RESTful APIs. Later, we focus on functional and security tests, for which we apply a technique called model-based software development. Based on an abstract model of the RESTful API that comprises resources, states and transitions a software generator not only creates the source code of the RESTful API but also creates a large number of test cases that can be immediately used to test the implementation. This paper describes the process of developing a software generator for test cases using state-of-the-art tools and provides an example to show the feasibility of our approach.
Towards Optimising the Data Flow in Distributed Applications BIBAFull-Text 1503-1508
  Felix Leif Keppmann; Maria Maleshkova; Andreas Harth
Networked applications continuously move towards service-based and modular solutions. At the same time, web technologies, proven to be modular and distributed, are applied to these application areas. However, web technologies have to be adapted to the new characteristics of the involved systems -- no explicit client and server roles, use of heterogeneous devices, or high frequency and low latency data communication. To this end, we present an approach for describing distributed applications in terms of graphs of communicating nodes. In particular, we develop a formal model for capturing the communication between nodes, by including dynamic and static data producing devices, data consuming client applications, as well as devices that can serve as data produces and consumers at the same time. In our model, we characterise nodes by their frequencies of data exchange. We complement our model with a decision algorithm for determining the pull/push communication direction to optimise the amount of redundantly transferred data (i.e., data that is pushed but cannot be processed or data that is pulled but is not yet updated). The presented work lays the foundation for creating distributed applications which can automatically optimise data exchange.
Rapido: A Sketching Tool for Web API Designers BIBAFull-Text 1509-1514
  Ronnie Mitra
Well-designed Web APIs must provide high levels of usability and must "get it right" on the first release. One strategy for accomplishing this feat is to identify usability issues early in the design process before a public release. Sketching is a useful way of improving the user experience early in the design phase. Designers can create many sketches and learn from them. The Rapido tool is designed to automate the Web API sketching process and help designers improve usability in an iterative fashion.
Adding Rules on Existing Hypermedia APIs BIBAFull-Text 1515-1517
  Michael Petychakis; Fenareti Lampathaki; Dimitrios Askounis
During the past years, the data deluge that prevails in the World Wide Web has been accompanied by a number of APIs that expose business logic. In this paper, we discuss a novel approach to enrich existing API standards definitions with business rules. Taking advantage of the REST principles, we aim at enabling the creation of generic clients that can dynamically navigate through semantically enriched web affordances with the help of Hydra-based Hypermedia API descriptions, which encapsulate the finite state machine of possible actions into SWRL rules.


An Introduction to Entity Recommendation and Understanding BIBAFull-Text 1521-1522
  Hao Ma; Yan Ke
Entities and the Knowledge about the entities have become indispensable building blocks in modern search engines. This tutorial aims to present the current state of research in the emerging semantic search and recommendation field by studying how to help users effectively explore the knowledge base as well as answer their information needs. Many live applications and systems will be demonstrated throughout this tutorial. After the completion of the tutorial, the audience will have an introduction and overview of what is the emerging topics of entity recommendation and understanding. The audience will learn and be able to understand some current research work as well as industry practices using computational intelligence techniques in entity recommendation and understanding. One of the major goals of this tutorial is to help audience identify a few research directions that could have big impact in the near future.
Constructing and Mining Web-Scale Knowledge Graphs: WWW 2015 Tutorial BIBAFull-Text 1523
  Antoine Bordes; Evgeniy Gabrilovich
Recent years have witnessed a proliferation of large-scale knowledge graphs, such as Freebase, Google's Knowledge Graph, YAGO, Facebook's Entity Graph, and Microsoft's Satori. Whereas there is a large body of research on mining homogeneous graphs, this new generation of information networks are highly heterogeneous, with thousands of entity and relation types and billions of instances of vertices and edges. In this tutorial, we will present the state of the art in constructing, mining, and growing knowledge graphs. The purpose of the tutorial is to equip newcomers to this exciting field with an understanding of the basic concepts, tools and methodologies, available datasets, and open research challenges. A publicly available knowledge base (Freebase) will be used throughout the tutorial to exemplify the different techniques.
Deep Learning for the Web BIBAFull-Text 1525-1526
  Kyomin Jung; Byoung-Tak Zhang; Prasenjit Mitra
Deep learning is a machine learning technology that automatically extracts higher-level representations from raw data by stacking multiple layers of neuron-like units. The stacking allows for extracting representations of increasingly-complex features without time-consuming, offline feature engineering. Recent success of deep learning has shown that it outperforms state-of-the-art systems in image processing, voice recognition, web search, recommendation systems, etc [1]. A lot of industrial-scale big data processing systems including IBM Watson's Jeopardy Contest 2011, Google Now, Facebook's face recognition system, and the voice recognition systems by Google and Microsoft use deep learning [2][3][6]. Deep learning has a huge potential to improve the intelligence of the web and the web service systems by efficiently and effectively mining big data on the Web[4][5]. This tutorial provides the basics of deep learning as well as its key applications. We give the motivation and underlying ideas of deep learning and describe the architectures and learning algorithms for various deep learning models. We also cover applications of deep learning for image and video processing, natural language and text data analysis, social data analytics, and wearable IoT sensor data with an emphasis in the domain of Web systems. We will deliver the key insight and understanding of these techniques, using graphical illustrations and examples that could be important in analyzing a large amount of Web data. The tutorial is prepared to attract general audience at the WWW Conference, who are interested in machine learning and big data analysis for Web data. The tutorial consists of five parts. The first part presents the basics of neural networks, and their structures. Then we explain the training algorithm via backpropagation, which is a common method of training artificial neural networks including deep neural networks. We will emphasize how each of these concepts can be used in various Web data analysis. In the second part of the tutorial, we describe the learning algorithms for deep neural networks and related ideas, such as contrastive divergence, wake-sleep algorithms, and Monte Carlo simulation. We then describe various kinds of deep architectures, including stacked autoencoders, deep belief networks [7], convolutional neural networks [8], and deep hypernetworks [9]. In the third part, we present more details of the recursive neural networks, which can learn structured tree outputs as well as vector representations for phrases and sentences. We first show how training the recursive neural network can be achieved by a modified version of the back-propagation algorithm introduced before. These modifications allow the algorithm to work on tree structures. Then we will present its applications to sentence analysis including POS tagging, and sentiment analysis. The fourth part discusses the neural networks used to generate word embeddings, such as Word2Vec [10], DSSM for deep semantic similarity [11], and object detection in images [12], such as GoogLeNet, and AlexNet. We will explain in detail the applications of these deep learning techniques in the analysis of various social network data. By this point, the audience should have a clear understanding of how to build a deep learning system for word, sentence and document level tasks. The fifth part of the tutorial will cover other application examples of deep learning. These include object segmentation and action recognition from videos [9], web data analytics, and wearable/IoT sensor data modeling for smart services.
Diffusion in Social and Information Metworks: Research Problems; Probabilistic Models & Machine Learning Methods BIBAFull-Text 1527-1528
  Manuel Gomez-Rodriguez; Le Song
In recent years, there has been an increasing effort on developing realistic models, and learning and inference algorithms to understand, predict, and influence diffusion over networks. This has been in part due to the increasing availability and granularity of large-scale diffusion data, which, in principle, allows for understanding and modeling not only macroscopic diffusion but also microscopic (node-level) diffusion. To this aim, a bottom-up approach has been typically considered, which starts by considering how particular ideas, pieces of information, products, or, more generally, contagions spread locally from node to node apparently at random to later produce global, macroscopic patterns at a network level. However, this bottom-up approach also raises significant modeling, algorithmic and computational challenges which require leveraging methods from machine learning, probabilistic modeling, event history analysis and graph theory, as well as the nascent field of network science. In this tutorial, we will present several diffusion models designed for fine-grained large-scale diffusion data, present some canonical research problem in the context of diffusion, and introduce state-of-the-art algorithms to solve some of these problems, in particular, network estimation, influence estimation and influence control.
Diversity and Novelty on the Web: Search, Recommendation, and Data Streaming Aspects BIBAFull-Text 1529-1530
  Rodrygo L. T. Santos; Pablo Castells; Ismail Sengor Altingovde; Fazli Can
This tutorial aims to provide a unifying account of current research on diversity and novelty in different web information systems. In particular, the tutorial will cover the motivations, as well as the most established approaches for producing and evaluating diverse results in search engines, recommender systems, and data streams, all within the context of the World Wide Web. By contrasting the state-of-the-art in these multiple domains, this tutorial aims to derive a common understanding of the diversification problem and the existing solutions, their commonalities and differences, as a means to foster new research directions.
From Complex Object Exploration to Complex Crowdsourcing BIBAFull-Text 1531-1532
  Sihem Amer-Yahia; Senjuti Basu Roy
Forming and exploring complex objects is at the heart of a variety of emerging web applications. Historically, existing work on complex objects has been developed in two separate areas: composite item retrieval and team formation. At the same time, emerging applications that harness the wisdom of crowd workers, such as, document editing by workers, sentence translation by fans (or fan-subbing), innovative design, citizen science or journalism, represent complex crowdsourcing, in which an object may represent a complex task formed by a set of sub-tasks or a team of workers who work together to solve the task. The goal of this tutorial is to bridge the gap between composite item retrieval and team formation and define new research directions for complex crowdsourcing applications.
Geo-Social Media Analytics BIBAFull-Text 1533-1534
  Cheng-Te Li; Hsun-Ping Hsieh
With the maturity of wireless communication techniques, GPS-equipped mobile devices become ubiquitous, and location-acquisition technologies and services are flourishing. These location applications as well as mobile devices, developed and combined with the social networking services, foster the emergence of geo-social media, a novel type of user-generated geo-social data, such as data from Facebook, Twitter, and Foursquare. In geo-social media, social connections and geo-location information of users are the essential elements, which keep track of their user interactions and their spatial-temporal activities. While social interactions are depicted by online network structures, and geographical activities are usually represented as check-in records. Due to the pervasive mobility of users, a huge amount of user-generated geo-social data is rapidly generated. Such big geo-social data not only collectively represents diverse kinds of real-world human activities, but also serves as a handy resource for various geo-social applications. In this tutorial, we aim to present the recent advances on geo-social media analytics in a systematic manner, consisting of five parts: (a) properties of geo-social networks, which unveil the relationships between human mobility and social structures; (b) geo-social link prediction, using geographical, mobility, activity features with various inference models; (c) location recommendation, leveraging personal, social, contextual, geographical, and content information; (d) geo-social influence propagation and maximization; and (e) connecting online and offline social networks for revisiting conventional SNA wisdom and developing applications that bridge virtual and physical social worlds. We also highlight the unsolved problems for each of the aforementioned topics and future directions of geo-social media analytics. We believe this tutorial can benefit the research communities of mobile web, data mining, information retrieval, social network analysis, recommender system, and marketing and advertisement.
Knowledge Bases for Web Content Analytics BIBFull-Text 1535
  Johannes Hoffart; Nicoleta Preda; Fabian M. Suchanek; Gerhard Weikum
Large Scale Network Analytics with SNAP: Tutorial at the World Wide Web 2015 Conference BIBAFull-Text 1537-1538
  Rok Sosic; Jure Leskovec
Many techniques for the modeling, analysis and optimization of Web related datasets are based on studies of large scale networks, where a network can contain hundreds of millions of nodes and billions of edges. Network analysis tools must provide not only extensive functionality, but also high performance in processing these large networks. The tutorial will present Stanford Network Analysis Platform (SNAP), a general purpose, high performance system for analysis and manipulation of large networks. SNAP is being used widely in studies of the Web datasets. SNAP consists of open source software, which provides a rich set of functions for performing network analytics, and a popular repository of publicly available real world network datasets. SNAP software APIs are available in Python and C++.
   The tutorial will cover all aspects of SNAP, including APIs and datasets. The tutorial will include a hands-on component, where the participants will have the opportunity to use SNAP on their computers.
LIKE and Recommendation in Social Media BIBAFull-Text 1539-1540
  Dongwon Lee; Huan Liu
This tutorial covers the state-of-the-art developments in LIKE and recommendation in social media. It is designed for graduate students, practitioners, or IT managers with general understanding on WWW and social media. No prerequisite is expected.
Mining Mobility Data BIBAFull-Text 1541-1542
  Spiros Papadimitriou; Tina Eliassi-Rad
The fairly recent explosion in the availability of reasonably fast wireless and mobile data networks has spurred demand for more capable mobile computing devices. Conversely, the emergence of new devices increases demand for better networks, creating a virtuous cycle. The current concept of a smartphone as an always-connected computing device with multiple sensing modalities was brought into the mainstream by the Apple iPhone just a few years ago. Such devices are now seeing an explosive growth. Additionally, for many people in the world, such devices will be the first computers they use. Furthermore, small, cheap, always-connected devices (standalone or peripheral) with additional sensing capabilities are very recently emerging, further blurring the lines between the Web, mobile applications (a.k.a. apps), and the real world. All of this opens up countless possibilities for data collection and analysis, for a broad range of applications. In this tutorial, we survey the state-of-the-art in terms of mining mobility data across different application areas such as ads, geo-social, privacy and security. Our tutorial consists of three parts. (1) We summarize the possibilities and challenges in the collection of data from various sensing modalities. (2) We cover cross-cutting challenges such as real-time analysis and security; and we outline cross-cutting algorithms for mobile data mining such as network inference and streaming algorithms. (3) We focus on how all of this can be usefully applied to broad classes of applications, notably mobile and location-based social, mobile advertising and search, mobile Web, and privacy and security. We conclude by showcasing the opportunities for new data collection techniques and new data mining methods to meet the challenges and applications that are unique to the mobile arena (e.g., leveraging emerging embedded computing and sensing technologies to collect a large variety and volume of new kinds of "big data").
Online Experiments for Computational Social Science BIBAFull-Text 1543
  Eytan Bakshy; Sean J. Taylor
Experiments are the gold standard for establishing causal relationships. While Web-based experiments ("A/B tests") have routinely been used to assess alternative ranking models or user interface designs, they have become increasingly popular for answering important questions in the social sciences. This tutorial teaches attendees how to design, plan, implement, and analyze online experiments. First, we review basic concepts in causal inference and motivate the need for experiments. Then we will discuss basic statistical tools to help plan experiments: exploratory analysis, power calculations, and the use of simulation in R. We then discuss statistical methods to estimate causal quantities of interest and construct appropriate confidence intervals. We then discuss how to design and implement online experiments using PlanOut, an open-source toolkit for advanced online experimentation used at Facebook. We will show how to implement a variety of experiments, including basic A/B tests, within-subjects designs, as well as more sophisticated experiments. We demonstrate how experimental designs from social computing literature can be implemented, and then collaboratively plan and implement an experiment together. We then discuss issues with logging and common errors in the deployment and analysis of experiments. Finally, we will conclude the tutorial with a discussion of strategies and scalable methods for analyzing online experiments, including working with weighted data, and data with single and multi-way dependence. Throughout the tutorial, attendees will be given code examples and participate in the planning, implementation, and analysis of a Web application using Python, PlanOut, and R.
Processing Large Graphs: Representations, Storage, Systems and Algorithms BIBAFull-Text 1545
  Deepak Ajwani; Marcel Karnstedt; Alessandra Sala
Analyzing and processing large graphs is of fundamental importance for an ever-growing number of applications. Significant advancements in the last few years at both, systems and algorithmic side, let graph processing become increasingly scalable and efficient. Often, these advances are still not well-known and well-understood outside the systems and algorithms communities. In particular, there is very little understanding of the various trade-offs involved in the usage of particular combinations of algorithms, data structures, and systems. This tutorial will have a particular focus on this aspect, imparting theoretical knowledge intertwined with hands-on experience.
   Since there is no clearly winning system/algorithm combination that performs best on all the different metrics, it is of utmost importance to understand the pros and cons of the various alternatives. The tutorial will enable application developers in industry and academics, students as well as researchers to make corresponding decisions in an informed way. The participants do neither require any particular a-priori knowledge apart from a basic understanding of core computer science concepts, nor any special equipment apart from their laptop.
   After a general introduction, we will describe the critical dimensions that need to be tackled together to effectively and efficiently overcome problems in large graph processing: data representation, data storage, acceleration via multi-core programming, and horizontally scalable graph-processing infrastructures. Thereafter, we will provide an overview of existing graph-processing systems and graph databases. This will be followed by hands-on experiences with popular representatives of such systems. Finally, we will provide a detailed description of algorithms used in these systems for fundamental problems like shortest paths and PageRank, how they are implemented, and how this affects the overall performance. We will also cover basic data structures such as distance oracles that can be built on these systems to efficiently answer distance queries for real-world graphs.
Urban Informatics and the Web BIBAFull-Text 1547
  Konstantinos Pelechrinis; Daniele Quercia
Based on a recent report from the United Nations, more than 50% of the world's population currently lives in cities. This percentage is projected to increase to 70% by the year 2050 [1]. As massive amounts of people move to urban areas there is a need for cities to be run more efficiently, while at the same time improving the quality of life of their dwellers. Nevertheless, the exact same force that sets the above requirement, i.e., the proliferation of urbanization levels, makes this task much harder and challenging, especially in megacities. Despite the aforementioned conflicting dynamics, many city management operations can be facilitated by appropriate exploitation of the unprecedented amount of data that can be made available to authorities from a variety of sources. In the era of big data and ubiquitous and pervasive mobile computing, different types of sensors such as parking meters, weather sensors, traffic sensors, pipe sensors, public transportation ticket readers and even human sensors (e.g., through web technologies, social media or cell phone usage data) can assist in these efforts. Furthermore, civic applications can exploit web and mobile technologies to deliver a livable, sustainable and resilient environment to the citizens. Harnessing these information streams and technologies presents many challenges that are in the epicenter of this tutorial. In this tutorial we will present the current practices and methods in the emerging field of urban informatics as well as the open challenges. The topics to be covered in this tutorial are structured in three sessions: (i) introduction to urban studies and urban informatics, (ii) civic data and technologies for urban sensing and (iii) analytical techniques used for urban data analysis. Finally, we will also provide concrete examples of urban informatics applications.

Workshop Summaries

LDOW 2013: The 8th Workshop on Linked Data on the Web BIBAFull-Text 1549-1550
  Sören Auer; Tim Berners-Lee; Christian Bizer; Tom Heath
This paper presents a brief summary of the eight workshop on Linked Data on the Web. The LDOW 2013 workshop is held in conjunction with the World Wide Web conference 2013. The focus is on data publishing, integration and consumption using RDF and other semantic representation formalisms and technologies.
#Microposts2015 -- 5th Workshop on 'Making Sense of Microposts': Big things come in small packages BIBAFull-Text 1551-1552
  Matthew Rowe; Milan Stankovic; Aba-Sah Dadzie
#Microposts2015, the 5th workshop on 'Making Sense of Microposts', is summarised by the sub-theme: 'big things come in small packages'. The workshop was borne out of research we were each carrying out as microblogging platforms became increasingly popular, and their value as a publishing platform and the data generated as a result began to be recognised. This phenomenon continues to grow, as microblogs provide a low-effort means of publishing information within private, but moreso public, fora, giving a voice to all in all arenas.