HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 060708091011-111-212-112-213-113-214-114-215-115-2

Companion Proceedings of the 2013 International Conference on the World Wide Web

Fullname:Companion Proceedings of the 22nd International Conference on World Wide Web
Editors:Daniel Schwabe; Virgílio Almeida; Hartmut Glaser; Ricardo Baeza-Yates; Sue Moon
Location:Rio de Janeiro, Brazil
Dates:2013-May-13 to 2013-May-17
Standard No:ISBN: 978-1-4503-2038-2; ACM DL: Table of Contents; hcibib: WWW13-2
Links:Conference Website
  1. WWW 2013-05-13 Volume 2
    1. Developer's track
    2. Posters: behavioral analysis and personalization
    3. Posters: bridging structured and unstructured data
    4. Posters: content analysis
    5. Posters: internet monetization and incentives
    6. Posters: search systems and applications
    7. Posters: security, privacy, trust, and abuse
    8. Posters: semantic web
    9. Posters: social networks and graph analysis
    10. Posters: user interfaces, human factors, and smart devices
    11. Posters: web engineering
    12. Posters: web mining
    13. Big data & web applications demonstrations
    14. Social media, crowdsourcing & services demonstrations
    15. Rich media, information extraction, & search demonstrations
    16. Doctoral consortium
    17. LILE'13 keynote talk
    18. LILE'13 session 1
    19. LILE'13 session 2
    20. LIME'13 keynote talk
    21. LIME'13 technical presentations
    22. LIME'13 demonstrations
    23. LSNA'13 keynote talks
    24. LSNA'13 technical presentations
    25. MABSDA'13 technical presentations
    26. MSM'13 keynote talk
    27. MSM'13 machine learning & statistical analysis
    28. MSM'13 trend & topic detection in microposts
    29. MSM'13 filtering & classification of microposts
    30. MSM'13 posters & demonstrations
    31. MSND'13 technical presentations
    32. PHDA'13 technical presentations
    33. PSOM'13 technical presentations
    34. RAMSS'13 keynote talks
    35. RAMSS'13 session 1
    36. RAMSS'13 session 2
    37. SIMPLEX'13 technical session 1
    38. SIMPLEX'13 technical session 2
    39. SNOW'13 opening
    40. SNOW'13 breaking the news
    41. SNOW'13 social news
    42. SOCM'13 technical presentations
    43. SRS'13 keynote talks
    44. SRS'13 technical presentations
    45. SWDM'13 keynote
    46. SWDM'13 twitter in action
    47. SWDM'13 keynote 2
    48. SWDM'13 insights from social web
    49. TEMPWEB'13 keynote talk
    50. TEMPWEB'13 web archiving
    51. TEMPWEB'13 identifying and leveraging time information
    52. TEMPWEB'13 studies and experience sharing
    53. WEBQUALITY'13 keynote talk
    54. WEBQUALITY'13 web content quality session
    55. WEBQUALITY'13 industry experience session
    56. WEBQUALITY'13 web spam detection session
    57. WI&C'13 keynote talk
    58. WI&C'13 session 1
    59. WI&C'13 session 2
    60. WOLE'13 keynote talk
    61. WOLE'13 technical presentations
    62. WOW'13 technical presentations
    63. WS-REST'13 technical presentations

WWW 2013-05-13 Volume 2

Developer's track

The linked data platform (LDP) BIBAFull-Text 1-2
  Arnaud J. Le Hors; Steve Speicher
As a result of the Linked Data Basic Profile submission, made by several organizations including IBM, EMC, and Oracle, the W3C launched in June 2012 the Linked Data Platform (LDP) Working Group (WG).
   The LDP WG is chartered to produce a W3C Recommendation for HTTP-based (RESTful) application integration patterns using read/write Linked Data. This work will benefit both small-scale in-browser applications (WebApps) and large-scale Enterprise Application Integration (EAI) efforts. It will complement SPARQL and will be compatible with standards for publishing Linked Data, bringing the data integration features of RDF to RESTful, data-oriented software development.
   This presentation introduces developers to the Linked Data Platform, explains its origins in the Open Services Lifecycle Collaboration (OSLC) initiative, describes how it fits with other existing Semantic Web technologies and the problems developers will be able to address using LDP, based on use cases such as the integration challenge the industry faces in the Application Lifecycle Management (ALM) space.
   By attending this presentation developers will get an understanding of this upcoming W3C Recommendation which is posed to become a major stepping stone in enabling broader adoption of Linked Data in the industry, not only for publishing data but also for integrating applications.
Quill: a collaborative design assistant for cross platform web application user interfaces BIBAFull-Text 3-6
  Vivian Genaro Motti; Dave Raggett
Web application development teams face an increasing burden when they need to come up with a consistent user interface across different platforms with different characteristics, for example, desktop, smart phone and tablet devices. This is going to get even worse with the adoption of HTML5 on TVs and cars. This short paper describes a browser-based collaborative design assistant that does the drudge work of ensuring that the user interfaces are kept in sync across all of the target platforms and with changes to the domain data and task models. This is based upon an expert system that dynamically updates the user interface design to reflect the developer's decisions. This is implemented in terms of constraint propagation and search through the design space. An additional benefit is the ease of providing accessible user interfaces in conjunction with assistive technologies.
Linked services infrastructure: a single entry point for online media related to any linked data concept BIBAFull-Text 7-10
  Lyndon Nixon
In this submission, we describe the Linked Services Infrastructure (LSI). It uses Semantic Web Service technology to map individual concepts (identified by Linked Data URIs) to sets of online media content aggegrated from heterogeneous Web APIs. It exposes this mapping service in a RESTful API and returns RDF based responses for further processing if desired. The LSI can be used as a general purpose tool for user agents to retrieve different online media resources to illustrate a concept to a user.
ResourceSync: leveraging sitemaps for resource synchronization BIBAFull-Text 11-14
  Bernhard Haslhofer; Simeon Warner; Carl Lagoze; Martin Klein; Robert Sanderson; Michael L. Nelson; Herbert Van de Sompel
Many applications need up-to-date copies of collections of changing Web resources. Such synchronization is currently achieved using ad-hoc or proprietary solutions. We propose ResourceSync, a general Web resource synchronization protocol that leverages XML Sitemaps. It provides a set of capabilities that can be combined in a modular manner to meet local or community requirements. We report on work to implement this protocol for arXiv.org and also provide an experimental prototype for the English Wikipedia as well as a client API.
Static typing & JavaScript libraries: towards a more considerate relationship BIBAFull-Text 15-18
  Benjamin Canou; Emmanuel Chailloux; Vincent Botbol
In this paper, after relating a short history of the mostly unhappy relationship between static typing and JavaScript (JS), we explain a new attempt at conciliating them which is more respectful of both worlds than other approaches. As an example, we present Onyo, an advanced binding of the Enyo JS library for the OCaml language. Onyo exploits the expressiveness of OCaml's type system to properly encode the structure of the library, preserving its design while statically checking that it is used correctly, and without introducing runtime overhead.
Client-server web applications widgets BIBAFull-Text 19-22
  Vincent Balat
The evolution of the Web from a content platform into an application platform has raised many new issues for developers. One of the most significant is that we are now developing distributed applications, in the specific context of the underlying Web technologies. In particular, one should be able to compute some parts of the page either on server or client sides, depending on the needs of developers, and preferably in the same language, with the same functions. This paper deals with the particular problem of user interface generation in this client-server setting. Many widget libraries for browsers are fully written in JavaScript and do not allow to generate the interface on server side, making more difficult the indexing of pages by search engines. We propose a solution that makes possible to generate widgets either on client side or on server side in a very flexible way. It is implemented in the Ocsigen framework.
Effective web scraping with OXPath BIBAFull-Text 23-26
  Giovanni Grasso; Tim Furche; Christian Schallhart
Even in the third decade of the Web, scraping web sites remains a challenging task: Most scraping programs are still developed as ad-hoc solutions using a complex stack of languages and tools. Where comprehensive extraction solutions exist, they are expensive, heavyweight, and proprietary.
   OXPath is a minimalistic wrapping language that is nevertheless expressive and versatile enough for a wide range of scraping tasks. In this presentation, we want to introduce you to a new paradigm of scraping: declarative navigation -- instead of complex scripting or heavyweight, limited visual tools, OXPath turns scraping into a simple two step process: pick the relevant nodes through an XPath expression and then specify which action to apply to those nodes. OXPath takes care of browser synchronisation, page and state management, making scraping as easy as node selection with XPath. To achieve this, OXPath does not require a complex or heavyweight infrastructure. OXPath is an open source project and has seen first adoption in a wide variety of scraping tasks.
CSS browser selector plus: a JavaScript library to support cross-browser responsive design BIBAFull-Text 27-30
  Richard Duchatsch Johansen; Talita Cristina Pagani Britto; Cesar Augusto Cusin
Developing websites for multiples devices have been a rough task for the past ten years. Devices features -- such as screen size, resolution, internet access, operating system, etc. -- change frequently and new devices emerge every day. Since W3C introduced media queries in CSS3, it's possible to developed tailored interfaces for multiple devices using a single HTML document. The approach of Responsive Web Design has been used media queries as support for developing adaptive and flexible layouts, however, it's not supported in legacy browsers. In this paper, we present CSS Browser Selector Plus, a cross-browser alternative method using JavaScript to support CSS3 media queries for developing responsive web considering older browsers.
A meteoroid on steroids: ranking media items stemming from multiple social networks BIBAFull-Text 31-34
  Thomas Steiner
We have developed an application called Social Media Illustrator that allows for finding media items on multiple social networks, clustering them by visual similarity, ranking them by different criteria, and finally arranging them in media galleries that were evaluated to be perceived as aesthetically pleasing. In this paper, we focus on the ranking aspect and show how, for a given set of media items, the most adequate ranking criterion combination can be found by interactively applying different criteria and seeing their effect on-the-fly. This leads us to an empirically optimized media item ranking formula, which takes social network interactions into account. While the ranking formula is not universally applicable, it can serve as a good starting point for an individually adapted formula, all within the context of Social Media Illustrator. A demo of the application is available publicly online at the URL http://social-media-illustrator.herokuapp.com/.
Creating 3rd generation web APIs with hydra BIBAFull-Text 35-38
  Markus Lanthaler
In this paper we describe a novel approach to build hypermedia-driven Web APIs based on Linked Data technologies such as JSON-LD. We also present the result of implementing a first prototype featuring both a RESTful Web API and a generic API client. To the best of our knowledge, no comparable integrated system to develop Linked Data-based APIs exists.

Posters: behavioral analysis and personalization

Scaling matrix factorization for recommendation with randomness BIBAFull-Text 39-40
  Lei Tang; Patrick Harrington
Recommendation is one of the core problems in eCommerce. In our application, different from conventional collaborative filtering, one user can engage in various types of activities in a sequence. Meanwhile, the number of users and items involved are quite huge, entailing scalable approaches. In this paper, we propose one simple approach to integrate multiple types of user actions for recommendation. A two-stage randomized matrix factorization is presented to handle large-scale collaborative filtering where alternating least squares or stochastic gradient descent is not viable. Empirical results show that the method is quite scalable, and is able to effectively capture correlations between different actions, thus making more relevant recommendations.
Link prediction in social networks based on hypergraph BIBAFull-Text 41-42
  Dong Li; Zhiming Xu; Sheng Li; Xin Sun
In recent years, online social networks have undergone a significant growth and attracted much attention. In these online social networks, link prediction is a critical task that not only offers insights into the factors behind creation of individual social relationship but also plays an essential role in the whole network growth. In this paper, we propose a novel link prediction method based on hypergraph. In contrast with conventional methods that using ordinary graph, we model the social network as a hypergraph, which can fully capture all types of objects and either the pair wise or high-order relations among these objects in the network. Then the link prediction task is formulated as a ranking problem on this hypergraph. Experimental results on Sina-Weibo dataset have demonstrated the effectiveness of our methods.
Inferring audience partisanship for YouTube videos BIBAFull-Text 43-44
  Ingmar Weber; Venkata Rama Kiran Garimella; Erik Borra
Political campaigning and the corresponding advertisement money are increasingly moving online. Some analysts claim that the U.S. elections were partly won through a smart use of (i) targeted advertising and (ii) social media. But what type of information do politicized users consume online? And, the other way around, for a given content, e.g. a YouTube video, is it possible to predict its political audience? To address this latter question, we present a large scale study of anonymous YouTube video consumption of politicized users, where political orientation is derived from visits to "beacon pages", namely, political partisan blogs. Though our techniques are relevant for targeted political advertising, we believe that our findings are also of a wider interest.
Cross-region collaborative filtering for new point-of-interest recommendation BIBAFull-Text 45-46
  Ning Zheng; Xiaoming Jin; Lianghao Li
With the rapid growth of location-based social networks (LBSNs), Point-of-Interest (POI) recommendation is in increasingly higher demand these years. In this paper, our aim is to recommend new POIs to a user in regions where he has rarely been before. Different from the classical memory-based recommendation algorithms using user rating data to compute similarity between users or items to make recommendation, we propose a cross-region collaborative filtering method based on hidden topics mined from user check-in records to recommend new POIs. Experimental results on a real-world LBSNs dataset show that our method consistently outperforms naive CF method.
Incorporating author preference in sentiment rating prediction of reviews BIBAFull-Text 47-48
  Subhabrata Mukherjee; Gaurab Basu; Sachindra Joshi
Traditional works in sentiment analysis do not incorporate author preferences during sentiment classification of reviews. In this work, we show that the inclusion of author preferences in sentiment rating prediction of reviews improves the correlation with ground ratings, over a generic author independent rating prediction model. The overall sentiment rating prediction for a review has been shown to improve by capturing facet level rating. We show that this can be further developed by considering author preferences in predicting the facet level ratings, and hence the overall review rating. To the best of our knowledge, this is the first work to incorporate author preferences in rating prediction.
Board coherence in Pinterest: non-visual aspects of a visual site BIBAFull-Text 49-50
  Krishna Y. Kamath; Ana-Maria Popescu; James Caverlee
Pinterest is a fast-growing interest network with significant user engagement and monetization potential. This paper explores quality signals for Pinterest boards, in particular the notion of board coherence. We find that coherence can be assessed with promising results and we explore its relation to quality signals based on social interaction.
Fragmented social media: a look into selective exposure to political news BIBAFull-Text 51-52
  Jisun An; Daniele Quercia; Jon Crowcroft
The hypothesis of selective exposure assumes that people crave like-minded information and eschew information that conflicts with their beliefs, and that has negative consequences on political life. Yet, despite decades of research, this hypothesis remains theoretically promising but empirically difficult to test. We look into news articles shared on Facebook and examine whether selective exposure exists or not in social media. We find a concrete evidence for a tendency that users predominantly share like-minded news articles and avoid conflicting ones, and partisans are more likely to do that. Building tools to counter partisanship on social media would require the ability to identify partisan users first. We will show that those users cannot be distinguished from the average user as the two subgroups do not show any demographic difference.
Utility discounting explains informational website traffic patterns before a hurricane BIBAFull-Text 53-54
  Ben Priest; Kevin Gold
We demonstrate that psychological models of utility discounting can explain the pattern of increased hits to weather websites in the days preceding a predicted weather disaster. We parsed the HTTP request lines issued by the web proxy for a mid-sized enterprise leading up to a hurricane, filtering for visits to weather-oriented websites. We fit four discounting models to the observed activity and found that our data matched hyperboloid models extending hyperbolic discounting.
Political hashtag hijacking in the U.S BIBAFull-Text 55-56
  Asmelash Teka Hadgu; Kiran Garimella; Ingmar Weber
We study the change in polarization of hashtags on Twitter over time and show that certain jumps in polarity are caused by "hijackers" engaged in a particular type of hashtag war.
Learning to annotate tweets with crowd wisdom BIBAFull-Text 57-58
  Wei Feng; Jianyong Wang
In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. However, two problems remain unsolved during an annotation: (1) Users have no way to know whether some related hashtags have already been created. (2) Users have their own way to categorize tweets. Thus personalization is needed. To address the above problems, we develop a statistical model for Personalized Hashtag Recommendation. With millions of "tweet, hashtag" pairs being generated everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Our model considers rich auxiliary information like URLs, locations, social relation, temporal characteristics of hashtag adoption, etc. We show our model successfully outperforms existing methods on real datasets crawled from Twitter.
To follow or not to follow: a feature evaluation BIBAFull-Text 59-60
  Yanan Zhu; Nazli Goharian
The features available in Twitter provide meaningful information that can be harvested to provide a ranked list of followees to each user. We hypothesize that retweet and mention features can be further enriched by incorporating both temporal and additional/indirect links from within user's community. Our empirical results provide insights into the effectiveness of each feature, and evaluate our proposed similarity measures in ranking the followees. Utilizing temporal information and indirect links improves the effectiveness of retweet and mention features in terms of nDCG.
Topical organization of user comments and application to content recommendation BIBAFull-Text 61-62
  Vidit Jain; Esther Galbrun
On a news website, an article may receive thousands of comments from its readers on a variety of topics. The usual display of these comments in a ranked list, e.g. by popularity, does not allow the user to follow discussions on a particular topic. Organizing them by semantic topics enables the user not only to selectively browse comments on a topic, but also to discover other significant topics of discussion in comments. This topical organization further allows to explicitly capture the immediate interests of the user even when she is not logged in. Here we use this information to recommend content that is relevant in the context of the comments being read by the user. We present an algorithm for building such a topical organization in a practical setting and study different recommendation schemes. In a pilot study, we observe these comments-to-article recommendations to be preferred over the standard article-to-article recommendations.
History-aware critiquing-based conversational recommendation BIBAFull-Text 63-64
  Yasser Salem; Jun Hong
In this paper we present a new approach to critiquing-based conversational recommendation, which we call History-Aware Critiquing (HAC). It takes a case-based reasoning approach by reusing relevant recommendation sessions of past users to short-cut the recommendation session of the current user. It selects relevant recommendation sessions from a case base that contains the successful recommendation sessions of past users. A past recommendation session can be selected if it contains similar recommended items to the ones in the current session and its critiques sufficiently overlap with the critiques so far in the current session. HAC extends experience-based critiquing (EBC).
   Our experimental results show that, in terms of recommendation efficiency, while EBC performs better than standard critiquing (STD), it does not perform as well as more recent techniques such as incremental critiquing (IC), whereas HAC achieves better recommendation efficiency over both STD and IC.
An effective general framework for localized content optimization BIBAFull-Text 65-66
  Yoshiyuki Inagaki; Jiang Bian; Yi Chang
Local search services have been gaining interests from Web users who seek the information near certain geographical locations. Particularly, those users usually want to find interesting information about what is happening nearby. In this poster, we introduce the localized content optimization problem to provide Web users with authoritative, attractive and fresh information that are really interesting to people around the certain location. To address this problem, we propose a general learning framework and develop a variety of features. Our evaluations based on the data set from a commercial localized Web service demonstrate that our framework is highly effective at providing contents that are more relevant to users' localized information need.
Unfolding dynamics in a social network: co-evolution of link formation and user interaction BIBAFull-Text 67-68
  Zhi Yang; Ji long Xue; Han Xiao Zhao; Xiao Wang; Ben Y. Zhao; Yafei Dai
Measurement studies of online social networks show that all social links are not equal, and the strength of each link is best characterized by the frequency of interactions between the linked users.To date, few studies have been able to examine detailed interaction data over time, and none have studied the problem of modeling user interactions. This paper proposes a generative model of social interactions that captures the inherently heterogeneous strengths of social links, thus having broad implications on the design of social network algorithms such as friend recommendation, information diffusion and viral marketing.
Mining emotions in short films: user comments or crowdsourcing? BIBAFull-Text 69-70
  Claudia Orellana-Rodriguez; Ernesto Diaz-Aviles; Wolfgang Nejdl
Short films are regarded as an alternative form of artistic creation, and they express, in a few minutes, a whole gamma of different emotions oriented to impact the audience and communicate a story. In this paper, we exploit a multi-modal sentiment analysis approach to extract emotions in short films, based on the film criticism expressed through social comments from the video-sharing platform YouTube. We go beyond the traditional polarity detection (i.e., positive/negative), and extract, for each analyzed film, four opposing pairs of primary emotions: joy-sadness, anger-fear, trust-disgust, and anticipation-surprise. We found that YouTube comments are a valuable source of information for automatic emotion detection when compared to human analysis elicited via crowdsourcing.
Offering language based services on social media by identifying user's preferred language(s) from romanized text BIBAFull-Text 71-72
  Mitesh M. Khapra; Salil Joshi; Ananthakrishnan Ramanathan; Karthik Visweswariah
With the increase of multilingual content and multilingual users on the web, it is prudent to offer personalized services and ads to users based on their language profile (i.e., the list of languages that a user is conversant with). Identifying the language profile of a user is often non-trivial because (i) users often do not specify all the languages known to them while signing up for an online service (ii) users of many languages (especially Indian languages) largely use Latin/Roman script to write content in their native language. This makes it non-trivial for a machine to distinguish the language of one comment from another. This situation presents an opportunity for offering following language based services for romanized content (i) hide romanized comments which belong to a language which is not known to the user (ii) translate romanized comments which belong to a language which is not known to the user (iii) transliterate romanized comments which belong to a language which is known to the user (iv) show language based ads by identifying languages known to a user based on the romanized comments that he wrote/read/liked. We first use a simple bootstrapping based semi-supervised algorithm for identify the language of a romanized comment. We then apply this algorithm to all the comments written/read/liked by a user to build a language profile of the user and propose that this profile can be used to offer the services mentioned above.

Posters: bridging structured and unstructured data

Zero-cost labelling with web feeds for weblog data extraction BIBAFull-Text 73-74
  George Gkotsis; Karen Stepanyan; Alexandra I. Cristea; Mike S. Joy
Data extraction from web pages often involves either human intervention for training a wrapper or a reduced level of granularity in the information acquired. Even though the study of social media has drawn the attention of researchers, weblogs remain a part of the web that cannot be harvested efficiently. In this paper, we propose a fully automated approach in generating a wrapper for weblogs, which exploits web feeds for cheap labelling of weblog properties. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. Our evaluation shows that our approach is robust, accurate and efficient in handling different types of weblogs.
On using inter-document relations in microblog retrieval BIBAFull-Text 75-76
  Jesus A. Rodriguez Perez; Yashar Moshfeghi; Joemon M. Jose
Microblog Ad-hoc retrieval has received much attention in recent years. As a result of the high vocabulary diversity of the publishing users, a mismatch is formed between the queries being formulated and the tweets representing the actual topics. In this work, we present a re-ranking approach relying on inter-document relations, which attempts to bridge this gap. Experiments with TREC's Microblog 2012 collection show that including such information in the retrieval process, statistically significantly improves retrieval effectiveness in terms of Precision and MAP, when the baseline performs well as a starting point.
Towards focused knowledge extraction: query-based extraction of structured summaries BIBFull-Text 77-78
  Besnik Fetahu; Bernardo Pereira Nunes; Stefan Dietze
Complexity and algorithms for composite retrieval BIBFull-Text 79-80
  Sihem Amer-Yahia; Francesco Bonchi; Carlos Castillo; Esteban Feuerstein; Isabel Méndez-Díaz; Paula Zabala
RESLVE: leveraging user interest to improve entity disambiguation on short text BIBAFull-Text 81-82
  Elizabeth L. Murnane; Bernhard Haslhofer; Carl Lagoze
We address the Named Entity Disambiguation (NED) problem for short, user-generated texts on the social Web. In such settings, the lack of linguistic features and sparse lexical context result in a high degree of ambiguity and sharp performance drops of nearly 50% in the accuracy of conventional NED systems. We handle these challenges by developing a general model of user-interest with respect to a personal knowledge context and instantiate it using Wikipedia. We conduct systematic evaluations using individuals' posts from Twitter, YouTube, and Flickr and demonstrate that our novel technique is able to achieve performance gains beyond state-of-the-art NED methods.
A hybrid approach for spotting, disambiguating and annotating places in user-generated text BIBAFull-Text 83-84
  Karen Stepanyan; George Gkotsis; Vangelis Banos; Alexandra I. Cristea; Mike Joy
We introduce a geolocation-aware semantic annotation model that extends the existing solutions for spotting and disambiguation of places within user-generated texts. The implemented prototype processes the text of weblog posts and annotates the places and toponyms. It outperforms existing solutions by taking into consideration the embedded geolocation data. The evaluation of the model is based on a set of randomly selected 3,165 geolocation embedded weblog posts, obtained from 1,775 web feeds. The results demonstrate a high degree of accuracy in annotation (87.7%) and a considerable gain (27.8%) in identifying additional entities, and therefore support the adoption of the model for supplementing the existing solutions.
HIGGINS: knowledge acquisition meets the crowds BIBAFull-Text 85-86
  Sarath Kumar Kondreddi; Peter Triantafillou; Gerhard Weikum
We present HIGGINS, a system for Knowledge Acquisition (KA), placing emphasis on its architecture. The distinguishing characteristic and novelty of HIGGINS lies in its blending of two engines: an automated Information Extraction (IE) engine, aided by semantic resources and statistics, and a game-based Human Computing (HC) engine. We focus on KA from web pages and text sources and, in particular, on deriving relationships between entities. As a running application we utilize movie narratives, from which we wish to derive relationships among movie characters.
AELA: an adaptive entity linking approach BIBAFull-Text 87-88
  Bianca Pereira; Nitish Aggarwal; Paul Buitelaar
The number of available Linked Data datasets has been increasing over time. Despite this, their use to recognise entities in unstructured plain text (Entity Linking task) is still limited to a small number of datasets. In this paper we propose a framework adaptable to the structure of generic Linked Data datasets. This adaptability allows a broader use of Linked Data datasets for the Entity Linking task.

Posters: content analysis

Content extraction using diverse feature sets BIBAFull-Text 89-90
  Matthew E. Peters; Dan Lecocq
The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, copyright notices and the like in web pages. In this paper we explore a machine learning approach to content extraction that combines diverse feature sets and methods. Our main contributions are: a) preliminary results that show combining feature sets generally improves performance; and b) a method for including semantic information via id and class attributes applicable to HTML5. We also show that performance decreases on a new benchmark data set that better represents modern chrome.
Predicting relevant news events for timeline summaries BIBAFull-Text 91-92
  Giang Binh Tran; Mohammad Alrifai; Dat Quoc Nguyen
This paper presents a framework for automatically constructing timeline summaries from collections of web news articles. We also evaluate our solution against manually created timelines and in comparison with related work.
Collective matrix factorization for co-clustering BIBAFull-Text 93-94
  Mrinmaya Sachan; Shashank Srivastava
We outline some matrix factorization approaches for co-clustering polyadic data (like publication data) using non-negative factorization (NMF). NMF approximates the data as a product of non-negative low-rank matrices, and can induce desirable clustering properties in the matrix factors through a flexible range of constraints. We show that simultaneous factorization of one or more matrices provides potent approaches for co-clustering.
Walk and learn: a two-stage approach for opinion words and opinion targets co-extraction BIBAFull-Text 95-96
  Liheng Xu; Kang Liu; Siwei Lai; Yubo Chen; Jun Zhao
This paper proposes a novel two-stage method for opinion words and opinion targets co-extraction. In the first stage, a Sentiment Graph Walking algorithm is proposed, which naturally incorporates syntactic patterns in a graph to extract opinion word/target candidates. In the second stage, we adopt a self-Learning strategy to refine the results from the first stage, especially for filtering out noises with high frequency and capturing long-tail terms. Preliminary experimental evaluation shows that considering pattern confidence in the graph is beneficial and our approach achieves promising improvement over three competitive baselines.
Discovery of technical expertise from open source code repositories BIBAFull-Text 97-98
  Rahul Venkataramani; Atul Gupta; Allahbaksh Asadullah; Basavaraju Muddu; Vasudev Bhat
Online Question and Answer websites for developers have emerged as the main forums for interaction during the software development process. The veracity of an answer in such websites is typically verified by the number of 'upvotes' that the answer garners from peer programmers using the same forum. Although this mechanism has proved to be extremely successful in rating the usefulness of the answers, it does not lend itself very elegantly to model the expertise of a user in a particular domain. In this paper, we propose a model to rank the expertise of the developers in a target domain by mining their activity in different opensource projects. To demonstrate the validity of the model, we built a recommendation system for StackOverflow which uses the data mined from GitHub.
Power dynamics in spoken interactions: a case study on 2012 republican primary debates BIBAFull-Text 99-100
  Vinodkumar Prabhakaran; Ajita John; Dorée D. Seligmann
In this paper, we explore how the power differential between participants of an interaction affects the way they interact in the context of political debates. We analyze the 2012 Republican presidential primary debates where we model the power index of each candidate in terms of their poll standings. We find that the candidates' power indices affected the way they interacted with others in the debates as well as how others interacted with them.
A non-learning approach to spelling correction in web queries BIBAFull-Text 101-102
  Jason Soo
We describe an adverse environment spelling correction algorithm, known as Segments. Segments is language and domain independent and does not require any training data. We evaluate Segments' correction rate of transcription errors in web query logs with the state-of-the-art learning approach. We show that in environments where learning approaches are not applicable, such as multilingual documents, Segments has an F1-score within 0.005 of the learning approach.
Extracting implicit features in online customer reviews for opinion mining BIBAFull-Text 103-104
  Yu Zhang; Weixiang Zhu
As the number of customer reviews grows very rapidly, it is essential to summarize useful opinions for buyers, sellers and producers. One key step of opinion mining is feature extraction. Most existing research focus on finding explicit features, only a few attempts have been made to extract implicit features. Nearly all existing research only concentrate on product features, few has paid attention to other features that relates to sellers, services and logistics. Therefore in this paper, we propose a novel co-occurrence association-based method, which aims to extract implicit features in customer reviews and provide more comprehensive and fine-grained mining results.
Co-training and visualizing sentiment evolvement for tweet events BIBAFull-Text 105-106
  Shenghua Liu; Wenjun Zhu; Ning Xu; Fangtao Li; Xue-qi Cheng; Yue Liu; Yuanzhuo Wang
Sentiment classification on tweet events attracts more interest in recent years. The large tweet stream stops people reading the whole classified list to understand the insights. We employ the co-training framework in the proposed algorithm. Features are split into text view features and non-text view features. Two Random Forest (RF) classifiers are trained with the common labeled data on the two views of features separately. Then for each specific event, they collaboratively and periodically train together to boost the classification performance. At last, we propose a "river" graph to visualize the intensity and evolvement of sentiment on an event, which demonstrates the intensity by both color gradient and opinion labels, and the ups and downs of confronting opinions by the river flow. Comparing with the well-known sentiment classifiers, our algorithm achieves consistent increases in accuracy on the tweet events from TREC 2011 Microblogging and our database. The visualization helps people recognize turning and bursting patterns, and predict sentiment trend in an intuitive way.
Cost-effective node monitoring for online hot event detection in sina weibo microblogging BIBAFull-Text 107-108
  Kai Chen; Yi Zhou; Hongyuan Zha; Jianhua He; Pei Shen; Xiaokang Yang
We propose a cost-effective hot event detection system over Sina Weibo platform, currently the dominant microblogging service provider in China. The problem of finding a proper subset of microbloggers under resource constraints is formulated as a mixed-integer problem for which heuristic algorithms are developed to compute approximate solution. Preliminary results show that by tracking about 500 out of 1.6 million candidate microbloggers and processing 15,000 microposts daily, 62% of the hot events can be detected five hours on average earlier than they are published by Weibo.
Solving electrical networks to incorporate supervision in random walks BIBAFull-Text 109-110
  Mrinmkaya Saqchan; Dirk Hovy; Eduard Hovy
Random walks is one of the most popular ideas in computer science. A critical assumption in random walks is that the probability of the walk being at a given vertex at a time instance converges to a limit independent of the start state. While this makes it computationally efficient to solve, it limits their use to incorporate label information. In this paper, we exploit the connection between Random Walks and Electrical Networks to incorporate label information in classification, ranking, and seed expansion.
Information current in Twitter: which brings hot events to the world BIBAFull-Text 111-112
  Peilei Liu; Jintao Tang; Ting Wang
In this paper we investigate information propagation in Twitter from the geographical view on the global scale. An information propagation phenomenon what we call "information current" has been discovered. According to this phenomenon, we propose a hypothesis that changes of information flows may be related to real-time events. Through analysis of retweets, we show that our hypothesis is supported by experiment results. Moreover, it is discovered that the retweet texts are more effective than common tweet texts for real-time event detection. This means that Twitter could be a good filter of texts for event detection.

Posters: internet monetization and incentives

Traffic quality based pricing in paid search using two-stage regression BIBAFull-Text 113-114
  Rouben Amirbekian; Ye Chen; Alan Lu; Tak W. Yan; Liangzhong Yin
While the cost-per-click (CPC) pricing model is main stream in sponsored search, the quality of clicks with respect to conversion rates and hence their values to advertisers may vary considerably from publisher to publisher in a large syndication network. Traffic quality shall be used to establish price discounts for clicks from different publishers. These discounts are intended to maintain incentives for high-quality online traffic and to make it easier for advertisers to maintain long-term bid stability. Conversion signal is noisy as each advertiser defines conversion in their own way. It is also very sparse. Traditional way of overcoming signal sparseness is to allow for longer time in accumulating modeling data. However, due to fast-changing conversion trends, such longer time leads to deterioration of the precision in measuring quality. To allow models to adjust to fast-changing trends with sufficient speed, we had to limit time-window for conversion data collection and make it much shorter than the several weeks window commonly used. Such shorter time makes conversions in the training set extremely sparse. To overcome resulting obstacles, we used two-stage regression similar to hurdle regression. First we employed logistic regression to predict zero conversion outcomes. Next, conditioned on non-zero outcomes, we used random forest regression to predict the value of the quotient of two conversion rates. Two-stage model accounts for the zero inflation due to the sparseness of the conversion signal. The combined model maintains good precision and allows faster reaction to the temporal changes in traffic quality including changes due to certain actions by publishers that may lead to click-price inflation.
Dynamic evaluation of online display advertising with randomized experiments: an aggregated approach BIBAFull-Text 115-116
  Joel Barajas; Ram Akella; Marius Holtan; Jaimie Kwon; Aaron Flores; Victor Andrei
We perform a randomized experiment to estimate the effects of a display advertising campaign on online user conversions. We present a time series approach using Dynamic Linear Models to decompose the daily aggregated conversions into seasonal and trend components. We attribute the difference between control and study trends to the campaign. We test the method using two real campaigns run for 28 and 21 days respectively from the Advertising.com ad network.
New features for query dependent sponsored search click prediction BIBAFull-Text 117-118
  Ilya Trofimov
Click prediction for sponsored search is an important problem for commercial search engines. Good click prediction algorithm greatly affects on the revenue of the search engine, user experience and brings more clicks to landing pages of advertisers. This paper presents new query-dependent features for the click prediction algorithm based on treating query and advertisement as bags of words. New features can improve prediction accuracy both for ads having many and few views.
Modeling click and relevance relationship for sponsored search BIBAFull-Text 119-120
  Wei Vivian Zhang; Ye Chen; Mitali Gupta; Swaraj Sett; Tak W. Yan
Click-through rate (CTR) prediction and relevance ranking are two fundamental problems in web advertising. In this study, we address the problem of modeling the relationship between CTR and relevance for sponsored search. We used normalized relevance scores comparable across all queries to represent relevance when modeling with CTR, instead of directly using human judgment labels or relevance scores valid only within same query. We classified clicks by identifying their relevance quality using dwell time and session information, and compared all clicks versus selective clicks effects when modeling relevance.
   Our results showed that the cleaned click signal outperforms raw click signal and others we explored, in terms of relevance score fitting. The cleaned clicks include clicks with dwell time greater than 5 seconds and last clicks in session. Besides traditional thoughts that there is no linear relation between click and relevance, we showed that the cleaned click based CTR can be fitted well with the normalized relevance scores using a quadratic regression model. This relevance-click model could help to train ranking models using processed click feedback to complement expensive human editorial relevance labels, or better leverage relevance signals in CTR prediction.
Optimization of ads allocation in sponsored search BIBAFull-Text 121-122
  Alexey Chervonenkis; Anna Sorokina; Valery A. Topinsky
We introduce the optimization problem of target-specific ads allocation. Technique for solving this problem for different target-constraints structures is presented. This technique allows us to find optimal ads allocation which maximize the target such as CTR, Revenue or other system performances subject to some linear constraints. We show that the optimal ads allocation depends on both the target and constraints variables.
A joint optimization of incrementality and revenue to satisfy both advertiser and publisher BIBAFull-Text 123-124
  Dmitry Pechyony; Rosie Jones; Xiaojing Li
A long-standing goal in advertising is to reduce wasted costs due to advertising to people who are unlikely to buy, as well as to those who would make a purchase whether they saw an ad or not. The ideal audience for the advertiser are those incremental users who would buy if shown an ad, and would not buy, if not shown the ad. On the other hand, for publishers who are paid when the user clicks or buys, revenue may be maximized by showing ads to those users who are most likely to click or purchase. We show analytically and empirically that an optimization towards one metric might result in an inferior performance in the other one. We present a novel algorithm, called SLC, that performs a joint optimization towards both advertisers' and publishers' goals and provides superior results in both.
A case-based analysis of the effect of offline media on online conversion actions BIBAFull-Text 125-126
  Damir Vandic; Didier Nibbering; Flavius Frasincar
In this paper, we investigate how offline advertising, by means of TV and radio, influences online search engine advertisement. Our research is based on the search engine-driven conversion actions of a 2012 marketing campaign of the potato chips manufacturer Lays. In our analysis we use several models, including linear regression (linear model) and Support Vector Regression (non-linear model). Our results confirm that offline commercials have a positive effect on the number of conversion actions from online marketing campaigns. This effect is especially visible in the first 50 minutes after the advertisement broadcasting.

Posters: search systems and applications

An error driven approach to query segmentation BIBAFull-Text 127-128
  Wei Zhang; Yunbo Cao; Chin-Yew Lin; Jian Su; Chew-Lim Tan
Query segmentation is the task of splitting a query into a sequence of non-overlapping segments that completely cover all tokens in the query. The majority of query segmentation methods are unsupervised. In this paper, we propose an error-driven approach to query segmentation (EDQS) with the help of search logs, which enables unsupervised training with guidance from the system-specific errors. In EDQS, we first detect the system's errors by examining the consistency among the segmentations of similar queries. Then, a model is trained by the detected errors to select the correct segmentation of a new query from the top-n outputs of the system. Our evaluation results show that EDQS can significantly boost the performance of state-of-the-art query segmentation methods on a publicly available data set.
Introducing search behavior into browsing based models of page's importance BIBAFull-Text 129-130
  Maxim Zhukovskiy; Andrei Khropov; Gleb Gusev; Pavel Serdyukov
BrowseRank algorithm and its modifications are based on analyzing users' browsing trails. Our paper proposes a new method for computing page importance using a more realistic and effective search-aware model of user browsing behavior than the one used in BrowseRank.
Learning to shorten query sessions BIBAFull-Text 131-132
  Cristina Ioana Muntean; Franco Maria Nardini; Fabrizio Silvestri; Marcin Sydow
We propose the use of learning to rank techniques to shorten query sessions by maximizing the probability that the query we predict is the "final" query of the current search session. We present a preliminary evaluation showing that this approach is a promising research direction.
The ACE theorem for querying the web of data BIBAFull-Text 133-134
  Jürgen Umbrich; Claudio Gutierrez; Aidan Hogan; Marcel Karnstedt; Josiane Xavier Parreira
Inspired by the CAP theorem, we identify three desirable properties when querying the Web of Data: Alignment (results up-to-date with sources), Coverage (results covering available remote sources), and Efficiency (bounded resources). In this short paper, we show that no system querying the Web can meet all three ACE properties, but instead must make practical trade-offs that we outline.
Towards leveraging closed captions for news retrieval BIBAFull-Text 135-136
  Roi Blanco; Gianmarco De Francisci Morales; Fabrizio Silvestri
IntoNow from Yahoo! is a second screen application that enhances the way of watching TV programs. The application uses audio from the TV set to recognize the program being watched, and provides several services for different use cases. For instance, while watching a football game on TV it can show statistics about the teams playing, or show the title of the song performed by a contestant in a talent show. The additional content provided by IntoNow is a mix of editorially curated and automatically selected one. From a research perspective, one of the most interesting and challenging use cases addressed by IntoNow is related to news programs (newscasts). When a user is watching a newscast, IntoNow detects it and starts showing online news articles from the Web. This work presents a preliminary study of this problem, i.e., to find an online news article that matches the piece of news discussed in the newscast currently airing on TV, and display it in real-time.
Searching the deep web using proactive phrase queries BIBAFull-Text 137-138
  Wensheng Wu; Tingting Zhong
This paper proposes ipq, a novel search engine that proactively transforms query forms of Deep Web sources into phrase queries, constructs query evaluation plans, and caches results for popular queries offline. Then at query time, keyword queries are simply matched with phrase queries to retrieve results. ipq embodies a novel dual-ranking framework for query answering and novel solutions for discovering frequent attributes and queries. Preliminary experiments show the great potentials of ipq.
Graded relevance ranking for synonym discovery BIBAFull-Text 139-140
  Andrew Yates; Nazli Goharian; Ophir Frieder
Interest in domain-specific search is steadfastly increasing, yielding a growing need for domain-specific synonym discovery. Existing synonym discovery methods perform poorly when faced with the realistic task of identifying a target term's synonyms from among many candidates. We approach domain-specific synonym discovery as a graded relevance ranking problem in which a target term's synonym candidates are ranked by their quality. In this scenario a human editor uses each ranked list of synonym candidates to build a domain-specific thesaurus. We evaluate our method for graded relevance ranking of synonym candidates and find that it outperforms existing methods.
Ranking method specialized for content descriptions of classical music BIBAFull-Text 141-142
  Taku Kuribayashi; Yasuhito Asano; Masatoshi Yoshikawa
In this paper, we propose novel ranking methods of effectively finding content descriptions of classical music compositions. In addition to rather naive methods using technical term frequency and latent Dirichlet allocation (LDA), we proposed a novel classification of web pages about classical music and used the characteristics of the classification for our method of search by labeled LDA (L-LDA). The experimental results showed our method performed well at finding content descriptions of classical music compositions.
Towards a development process for geospatial information retrieval and search BIBAFull-Text 143-144
  Dirk Ahlers
Geospatial search as a special type of vertical search has specific requirements and challenges. While the general principle of resource discovery, extraction, indexing, and search holds, geospatial search systems are tailored to the specific use case at hand with many individual adaptations. In this short overview, we aim to collect and organize the main organizing principles for the multitude of challenges and adaptations to be considered within the development process to work towards a more formal description.
Searching for interestingness in Wikipedia and Yahoo!: answers BIBAFull-Text 145-146
  Yelena Mejova; Ilaria Bordino; Mounia Lalmas; Aristides Gionis
In many cases, when browsing the Web, users are searching for specific information. Sometimes, though, users are also looking for something interesting, surprising, or entertaining. Serendipitous search puts interestingness on par with relevance. We investigate how interesting are the results one can obtain via serendipitous search, and what makes them so, by comparing entity networks extracted from two prominent social media sites, Wikipedia and Yahoo! Answers.
A click model for time-sensitive queries BIBAFull-Text 147-148
  Seung Eun Lee; Dongug Kim
User behavior on search results pages provides a clue about the query intent and the relevance of documents. To incorporate this information into search rankings, a variety of click modeling techniques have been proposed so far and now they are widely used in commercial search engines. For time-sensitive queries, however, applying click models can degrade the search relevance because the best document in the past may not be the current best answer. To address this problem, it is required to detect a time point, a turning point, where the search intent for a given query changes and to reflect it in click models. In this work, we devised a method to detect the turning point of a query from its search volume history. The proposed click model is designed to take only user behavior observed after the turning points. We applied our model in a commercial search engine and evaluated its relevance.
Intent classification of voice queries on mobile devices BIBAFull-Text 149-150
  Subhabrata Mukherjee; Ashish Verma; Kenneth W. Church
Mobile query classification faces the usual challenges of encountering short and noisy queries as in web search. However, the task of mobile query classification is made difficult by the presence of more inter-active and personalized queries like map, command and control, dialogue, joke etc. Voice queries are made more difficult than typed queries due to the errors introduced by the automatic speech recognizer. This is the first paper, to the best of our knowledge, to bring the complexities of voice search and intent classification together. In this paper, we propose some novel features for intent classification, like the url's of the search engine results for the given query. We also show the effectiveness of other features derived from the part-of-speech information of the query and search engine results, in proposing a multi-stage classifier for intent classification. We evaluate the classifier using tagged data, collected from a voice search android application, where we achieve an average of 22% f-score improvement per category, over the commonly used bag-of-words baseline.
Leveraging geographical metadata to improve search over social media BIBAFull-Text 151-152
  Alexander Kotov; Yu Wang; Eugene Agichtein
We propose the methods for document, query and relevance model expansion that leverage geographical metadata provided by social media. In particular, we propose a geographically-aware extension of the LDA topic model and utilize the resulting topics and language models in our expansion methods. The proposed approach has been experimentally evaluated over a large sample of Twitter, demonstrating significant improvements in search accuracy over traditional (geographically-unaware) retrieval models.
Place value: word position shifts vital to search dynamics BIBAFull-Text 153-154
  Rishiraj Saha Roy; Anusha Suresh; Niloy Ganguly; Monojit Choudhury
With fast changing information needs in today's world, it is imperative that search engines precisely understand and exploit temporal changes in Web queries. In this work, we look at shifts in preferred positions of segments in queries over an interval of four years. We find that such shifts can predict key changes in usage patterns, and explain the observed increase in query lengths. Our findings indicate that recording positional statistics can be vital for understanding user intent in Web search queries.

Posters: security, privacy, trust, and abuse

Synthetic review spamming and defense BIBAFull-Text 155-156
  Alex Morales; Huan Sun; Xifeng Yan
Online reviews are widely adopted in many websites such as Amazon, Yelp, and TripAdvisor. Positive reviews can bring significant financial gains, while negative ones often cause sales loss. This fact, unfortunately, results in strong incentives for opinion spam to mislead readers. Instead of hiring humans to write deceptive reviews, in this work, we bring into attention an automated, low-cost process for generating fake reviews, variations of which could be easily employed by evil attackers in reality. To the best of our knowledge, we are the first to expose the potential risk of machine-generated deceptive reviews. Our simple review synthesis model uses one truthful review as a template, and replaces its sentences with those from other reviews in a repository. The fake reviews generated by this mechanism are extremely hard to detect: Both the state-of-the-art machine detectors and human readers have an error rate of 35%-48%. A novel defense method that leverages the difference of semantic flows between fake and truthful reviews is developed, reducing the detection error rate to approximately 22%. Nevertheless, it is still a challenging research task to further decrease the error rate.
REDACT: a framework for sanitizing RDF data BIBAFull-Text 157-158
  Jyothsna Rachapalli; Vaibhav Khadilkar; Murat Kantarcioglu; Bhavani Thuraisingham
Resource Description Framework (RDF) is the foundational data model of the Semantic Web, and is essentially designed for integration of heterogeneous data from varying sources. However, lack of security features for managing sensitive RDF data while sharing may result in privacy breaches, which in turn, result in loss of user trust. Therefore, it is imperative to provide an infrastructure to secure RDF data. We present a set of graph sanitization operations that are built as an extension to SPARQL. These operations allow one to sanitize sensitive parts of an RDF graph and further enable one to build more sophisticated security and privacy features, thus allowing RDF data to be shared securely.
Framework for evaluation of text captchas BIBAFull-Text 159-160
  Achint Thomas; Kunal Punera; Lyndon Kennedy; Belle Tseng; Yi Chang
Interactive websites use text-based Captchas to prevent unauthorized automated interactions. These Captchas must be easy for humans to decipher while being difficult to crack by automated means. In this work we present a framework for the systematic study of Captchas along these two competing objectives. We begin by abstracting a set of distortions that characterize current and past commercial text-based Captchas. By means of user studies, we quantify the way human Captcha solving performance varies with changes in these distortion parameters. To quantify the effect of these distortions on the accuracy of automated solvers (bots), we propose a learning-based algorithm that performs automated Captcha segmentation driven by character recognition. Results show that our proposed algorithm is generic enough to solve text-based Captchas with widely varying distortions without requiring the use of hand-coded image processing or heuristic rules.
A probability-based trust prediction model using trust-message passing BIBAFull-Text 161-162
  Hyun-Kyo Oh; Jin-Woo Kim; Sang-Wook Kim; Kichun Lee
We propose a probability-based trust prediction model based on trust-message passing which takes advantage of the two kinds of information: an explicit information and an implicit information.
RepRank: reputation in a peer-to-peer online system BIBAFull-Text 163-164
  Zeqian Shen; Neel Sundaresan
Peer-to-peer e-commerce networks exemplify online lemon markets. Trust is key to sustaining these networks. We present a reputation system named RepRank that approaches trust with an intuition that in the peer-to-peer e-commerce world consisting of buyers and sellers, good buyers are those who buy from good sellers, and good sellers are those from whom good buyers buy. We propagate trust and distrust in a network using this mutually recursive definition. We discuss the algorithms and present the evaluation results.
The STAC (security toolbox: attacks & countermeasures) ontology BIBAFull-Text 165-166
  Amelie Gyrard; Christian Bonnet; Karima Boudaoud
We present a security ontology to help non-security expert software designers or developers to: (1) design secure software and, (2) to understand and be aware of main security concepts and issues. Our security ontology defines the main security concepts such as attacks, countermeasures, security properties and their relationships. Countermeasures can be cryptographic concepts (encryption algorithm, key management, digital signature, hash function), security tools or security protocols. The purpose of this ontology is to be reused in numerous domains such as security of web applications, network management or communication networks (sensor, cellular and wireless). The ontology and a user interface (to use the ontology) are available online.

Posters: semantic web

Modeling uncertain provenance and provenance of uncertainty in W3C PROV BIBAFull-Text 167-168
  Tom De Nies; Sam Coppens; Erik Mannens; Rik Van de Walle
This paper describes how to model uncertain provenance and provenance of uncertain things in a flexible and unintrusive manner using PROV, W3C's new standard for provenance. Three new attributes with clearly defined values and semantics are proposed. Modeling this information is an important step towards the modeling and derivation of trust from resources whose provenance is described using PROV.
Scalable processing of flexible graph pattern queries on the cloud BIBAFull-Text 169-170
  Padmashree Ravindra; Kemafor Anyanwu
Flexible exploration of large RDF datasets with unknown relationships can be enabled using 'unbound-property' graph pattern queries. Relational-style processing of such queries using normalized relations results in redundant information in intermediate results due to the repetition of adjoining bound (fixed) properties. Such redundancy negatively impacts the disk I/O, network transfer costs, and the required disk space while processing RDF query workloads on MapReduce-based systems. This work proposes packing and lazy unpacking strategies to minimize the redundancy in intermediate results while processing unbound-property queries. In addition to keeping the results compact, this work evaluates RDF queries using the Nested TripleGroup Data Model and Algebra (NTGA) that enables shorter MapReduce execution workflows. Experimental results demonstrate the benefit of this work over RDF query processing using relational-style systems such as Apache Pig and Hive.
Computing semantic relatedness from human navigational paths on Wikipedia BIBAFull-Text 171-172
  Philipp Singer; Thomas Niebler; Markus Strohmaier; Andreas Hotho
This paper presents a novel approach for computing semantic relatedness between concepts on Wikipedia by using human navigational paths for this task. Our results suggest that human navigational paths provide a viable source for calculating semantic relatedness between concepts on Wikipedia. We also show that we can improve accuracy by intelligent selection of path corpora based on path characteristics indicating that not all paths are equally useful. Our work makes an argument for expanding the existing arsenal of data sources for calculating semantic relatedness and to consider the utility of human navigational paths for this task.
Discovering multilingual concepts from unaligned web documents by exploring associated images BIBAFull-Text 173-174
  Xiaochen Zhang; Xiaoming Jin; Lianghao Li; Dou Shen
The Internet is experiencing an explosion of information presented in different languages. Though written in different languages, some articles implicitly share common concepts. In this paper, we propose a novel framework to mine cross-language common concepts from unaligned web documents. Specifically, visual words of images are used to bridge articles in different languages and then common concepts of multiple languages are learned by using an existing topic modeling algorithm. We conduct cross-lingual text classification in a real-world data set using the mined multilingual concepts from our method. The experiment results show that our approach is effective to mine cross-lingual common concepts.
Fria: fast and robust instance alignment BIBAFull-Text 175-176
  Sanghoon Lee; Jongwuk Lee; Seung-won Hwang
This paper proposes Fria, a fast and robust instance alignment framework across two independently built knowledge bases (KBs). Our objective is two-fold: (1) to design an effective instance similarity measure and (2) to build a fast and robust alignment framework. Specifically, Fria consists of two-phases. Fria first achieves high-precision alignment for seed matches which have strong evidence for aligning. To obtain high-recall alignment, Fria then divides non-matched instances according to the types identified from seeds, and gives additional chances to the same-typed instances to be matched. Experimental results show that Fria is fast and robust, by achieving comparable accuracy to state-of-the-arts and a 10-times speed up.

Posters: social networks and graph analysis

Popularity prediction in microblogging network: a case study on sina weibo BIBAFull-Text 177-178
  Peng Bao; Hua-Wei Shen; Junming Huang; Xue-Qi Cheng
Predicting the popularity of content is important for both the host and users of social media sites. The challenge of this problem comes from the inequality of the popularity of content. Existing methods for popularity prediction are mainly based on the quality of content, the interface of social media site to highlight contents, and the collective behavior of users. However, little attention is paid to the structural characteristics of the networks spanned by early adopters, i.e., the users who view or forward the content in the early stage of content dissemination. In this paper, taking the Sina Weibo as a case, we empirically study whether structural characteristics can provide clues for the popularity of short messages. We find that the popularity of content is well reflected by the structural diversity of the early adopters. Experimental results demonstrate that the prediction accuracy is significantly improved by incorporating the factor of structural diversity into existing methods.
The power of local information in PageRank BIBAFull-Text 179-180
  Marco Bressan; Enoch Peserico; Luca Pretto
Can one assess, by visiting only a small portion of a graph, if a given node has a significantly higher PageRank score than another? We show that the answer strongly depends on the interplay between the required correctness guarantees (is one willing to accept a small probability of error?) and the graph exploration model (can one only visit parents and children of already visited nodes?).
Semantically sampling in heterogeneous social networks BIBAFull-Text 181-182
  Cheng-Lun Yang; Perng-Hwa Kung; Chun-An Chen; Shou-De Lin
Online social networks sampling identifies a representative subnetwork that preserves certain graph property given heterogeneous semantics, with the full network not observed during sampling. This study presents a property, Relational Profile, to account for conditional dependency of node and relation type semantics in a network, and a sampling method to preserve the property. We show the proposed sampling method better preserves Relational Profile. Next, Relational Profile can design features to boost network prediction. Finally, our sampled network trains more accurate prediction models than other sampling baselines.
Sampling bias in user attribute estimation of OSNs BIBAFull-Text 183-184
  Hosung Park; Sue Moon
Recent work on unbiased sampling of OSNs has focused on estimation of the network characteristics such as degree distributions and clustering coefficients. In this work we shift the focus to node attributes. We show that existing sampling methods produce biased outputs and need modifications to alleviate the bias.
Link recommendation for promoting information diffusion in social networks BIBAFull-Text 185-186
  Dong Li; Zhiming Xu; Sheng Li; Xin Sun; Anika Gupta; Katia Sycara
Online social networks mainly have two functions: social interaction and information diffusion. Most of current link recommendation researches only focus on strengthening the social interaction function, but ignore the problem of how to enhance the information diffusion function. For solving this problem, this paper introduces the concept of user diffusion degree and proposes the algorithm for calculating it, then combines it with traditional recommendation methods for reranking recommended links. Experimental results on Email dataset and Amazon dataset under Independent Cascade Model and Linear Threshold Model show that our method noticeably outperforms the traditional methods in terms of promoting information diffusion.
Domain-sensitive opinion leader mining from online review communities BIBAFull-Text 187-188
  Qingliang Miao; Shu Zhang; Yao Meng; Hao Yu
In this paper, we investigate how to identify domain-sensitive opinion leaders in online review communities, and present a model to rank domain-sensitive opinion leaders. To evaluate the effectiveness of the proposed model, we conduct preliminary experiments on a real-world dataset from Amazon.com. Experimental results indicate that the proposed model is effective in identifying domain-sensitive opinion leaders.
Understanding election candidate approval ratings using social media data BIBAFull-Text 189-190
  Danish Contractor; Tanveer Afzal Faruquie
The last few years has seen an exponential increase in the amount of social media data generated daily. Thus, researchers have started exploring the use of social media data in building recommendation systems, prediction models, improving disaster management, discovery trending topics etc. An interesting application of social media is for the prediction of election results. The recently conducted 2012 US Presidential election was the "most tweeted" election in history and provides a rich source of social media posts. Previous work on predicting election outcomes from social media has been largely been based on sentiment about candidates, total volumes of tweets expressing electoral polarity and the like. In this paper we use a collection of tweets to predict the daily approval ratings of the two US presidential candidates and also identify topics that were causal to the approval ratings.
Extracting the multilevel communities based on network structural and nonstructural information BIBAFull-Text 191-192
  Xin Liu; Tsuyoshi Murata; Ken Wakita
Many real-world networks contain nonstructural information on nodes, such as the spatial coordinate of a location, profile of a person, or contents of a web page. In this paper, we propose Dist-Modularity, a unified modularity measure, which is useful in extracting the multilevel communities based on network structural and nonstructural information.
Structural-interaction link prediction in microblogs BIBAFull-Text 193-194
  Jia Yantao; Wang Yuanzhuo; Li Jingyuan; Feng Kai; Cheng Xueqi; Li Jianchen
Link prediction in Microblogs by using unsupervised methods aims to find an appropriate similarity measure between users in the network. However, the measures used by existing work lack a simple way to incorporate the structure of the network and the interactions between users. In this work, we define the retweet similarity to measure the interactions between users in Twitter, and propose a structural-interaction based matrix factorization model for following-link prediction. Experiments on the real world Twitter data show our model outperforms state-of-the-art methods.
Fast anomaly detection despite the duplicates BIBAFull-Text 195-196
  Jay Yoon Lee; U. Kang; Danai Koutra; Christos Faloutsos
Given a large cloud of multi-dimensional points, and an off-the shelf outlier detection method, why does it take a week to finish? After careful analysis, we discovered that duplicate points create subtle issues, that the literature has ignored: if dmax is the multiplicity of the most over-plotted point, typical algorithms are quadratic on dmax. We propose several ways to eliminate the problem; we report wall-clock times and our time savings; and we show that our methods give either exact results, or highly accurate approximate ones.
Recommendation for online social feeds by exploiting user response behavior BIBAFull-Text 197-198
  Ping-Han Soh; Yu-Chieh Lin; Ming-Syan Chen
In recent years, online social networks have been dramatically expanded. Active users spend hours communicating with each other via these networks such that an enormous amount of data is created every second. The tremendous amount of newly created information costs users much time to discover interesting messages from their online social feeds. The problem is even exacerbated if users access these networks via mobile devices. To assist users in discovering interesting messages efficiently, in this paper, we propose a new approach to recommend interesting messages for each user by exploiting the user's response behavior. We extract data from the most popular social network, and the experimental results show that the proposed approach is effective and efficient.

Posters: user interfaces, human factors, and smart devices

Lists as coping strategy for information overload on Twitter BIBAFull-Text 199-200
  Simon de la Rouviere; Kobus Ehlers
When following too many users on microblogging services, information overload occurs due to increased and varied communication activity. Users then either leave, or employ coping strategies to continue benefiting from the service. Through a crawl of 31 684 random users from Twitter and a qualitative survey with 115 respondents, it has been determined that by using lists as an information management coping strategy (filtering and compartmentalising varied communication activity), users are capable of following more users and experience fewer symptoms of information overload.
To crop, or not to crop: compiling online media galleries BIBAFull-Text 201-202
  Thomas Steiner; Christopher Chedeau
We have developed an application for the automatic generation of media galleries that visually and audibly summarize events based on media items like videos and photos from multiple social networks. Further, we have evaluated different media gallery styles with online surveys and examined their pros and cons. Besides the survey results, our contribution is also the application itself, where media galleries of different styles can be created on-the-fly. A demo is available at http://social-media-illustrator.herokuapp.com/.
Unsupervised approach to generate informative structured snippets for job search engines BIBAFull-Text 203-204
  Nikita Spirin; Karrie Karahalios
Aiming to improve user experience for a job search engine, in this paper we propose an idea to switch from query-biased snippets used by most web search engines to rich structured snippets associated with the main sections of a job posting page, which are more appropriate for job search due to specific user needs and the structure of job pages. We present a very simple yet actionable approach to generate such snippets in an unsupervised way. The advantages of the proposed approach are two-fold: it doesn't require manual annotation and therefore can be easily deployed to many languages, which is a desirable property for a job search engine operating internationally; it fuses naturally with the trend towards Mobile Web where the content needs to be optimized for small screen devices and informativeness.
Learning to recommend with multi-faceted trust in social networks BIBAFull-Text 205-206
  Lei Guo; Jun Ma; Zhumin Chen
Traditionally, trust-aware recommendation methods that utilize trust relations for recommender systems assume a single type of trust between users. However, this assumption ignores the fact that trust as a social concept inherently has many aspects. A user may place trust differently to different people. Motivated by this observation, we propose a novel probabilistic factor analysis method, which learns the multi-faceted trust relations and user profiles through a shared user latent feature space. Experimental results on the real product rating data set show that our approach outperforms state-of-the-art methods on the RMSE measure.
Hidden view game: designing human computation games to update maps and street views BIBAFull-Text 207-208
  Jongin Lee; John Kim; KwanHong Lee
Although the Web has abundant information, it does not necessarily contain the latest, most recently updated information. In particular, interactive map websites and the accompanying street view applications often contain information that is a few years old and are somewhat outdated because street views can change quickly. In this work, we propose Hidden View -- a human computation mobile game that enables the updating of maps and street views with the latest information. The preliminary implementation of the game is described and some results collected from a sample user study are presented. This work is the first step towards leveraging human computation and an individual's familiarity with different points-of-interest to keep maps and street views up to date.
ASQ: interactive web presentations for hybrid MOOCs BIBAFull-Text 209-210
  Vasileios Triglianos; Cesare Pautasso
ASQ is a Web application for creating and delivering interactive HTML5 presentations. It is designed to support teachers that need to gather real-time feedback from the students while delivering their lectures. Presentation slides are delivered to viewers that can answer the questions embedded in the slides. The objective is to maximize the efficiency of bi-directional communication between the lecturer and a large audience. More specifically, in the context of a hybrid MOOC classroom, a teacher can use ASQ to get feedback in real time about the level of comprehension of the presented material while reducing the time for gathering survey data, monitoring attendance and assessing solutions.

Posters: web engineering

QMapper: a tool for SQL optimization on hive using query rewriting BIBAFull-Text 211-212
  Yingzhong Xu; Songlin Hu
Although HiveQL offers similar features with SQL, it is still difficult to map complex SQL queries into HiveQL and manual translation often leads to poor performance. A tool named QMapper is developed to address this problem by utilizing query rewriting rules and cost-based MapReduce flow evaluation on the basis of column statistics. Evaluation demonstrates that while assuring the correctness, QMapper improves the performance up to 42% in terms of execution time.
Partitioning RDF exploiting workload information BIBAFull-Text 213-214
  Rebeca Schroeder; Raqueline Penteado; Carmem Satie Hara
One approach to leverage scalable systems for RDF management is partitioning large datasets across distributed servers. In this paper we consider workload data, given in the form of query patterns and their frequencies, for determining how to partition RDF datasets. Our experimental study shows that our workload-aware method is an effective way to cluster related data and provides better query response times compared to an elementary fragmentation method.
Correlation discovery in web of things BIBAFull-Text 215-216
  Lina Yao; Quan Z. Sheng
With recent advances in radio-frequency identification (RFID), wireless sensor networks, and Web services, Web of Things (WoT) is gaining a considerable momentum as an emerging paradigm where billions of physical objects will be interconnected and present over the World Wide Web. One inevitable challenge in the new era of WoT lies in how to efficiently and effectively manage things, which is critical for a number of important applications such as object search, recommendation, and composition. In this paper, we propose a novel approach to discover the correlations of things by constructing a relational network of things (RNT) where similar things are linked via virtual edges according to their latent correlations by mining three dimensional information in the things usage events in terms of user, temporality and spatiality. With RNT, many problems centered around things management such as objects classification, discovery and recommendation can be solved by exploiting graph-based algorithms. We conducted experiments using real-world data collected over a period of four months to verify and evaluate our model and the results demonstrate the feasibility of our approach.
The atomic web browser BIBAFull-Text 217-218
  Cesare Pautasso; Masiar Babazadeh
The Atomic Web Browser achieves atomicity for distributed transactions across multiple RESTful APIs. Assuming that the participant APIs feature support for the Try-Confirm/Cancel pattern, the user may navigate with the Atomic Web Browser among multiple Web sites to perform local resource state transitions (e.g., reservations or bookings). Once the user indicates that the navigation has successfully completed, the Atomic Web browser takes care of confirming the local transitions to achieve the atomicity of the global transaction.
XML validation: looking backward -- strongly typed and flexible XML processing are not incompatible BIBAFull-Text 219-220
  Pierre Geneves; Nabil Layaida
One major concept in web development using XML is validation: checking whether some document instance fulfills structural constraints described by some schema. Over the last few years, there has been a growing debate about XML validation, and two main schools of thought emerged about the way it should be done. On the one hand, some advocate the use of validation with respect to complete grammar-based descriptions such as DTDs and XML Schemas. On the other hand, motivated by a need for greater flexibility, others argue for no validation at all, or prefer the use of lightweight constraint languages such as Schematron with the aim of validating only required constraints, while making schema descriptions more compositional and more reusable.
   Owing to a logical compilation, we show that validators used in each of these approaches share the same theoretical foundations, meaning that the two approaches are far from being incompatible. Our findings include that the logic in [2] can be seen as a unifying formal ground for the construction of robust and efficient validators and static analyzers using any of these schema description techniques. This reconciles the two approaches from both a theoretical and a practical perspective, therefore facilitating any combination of them.
Co-operative content adaptation framework: satisfying consumer and content creator in resource constrained browsing BIBAFull-Text 221-222
  Ayush Dubey; Pradipta De; Kuntal Dey; Sumit Mittal; Vikas Agarwal; Malolan Chetlur; Sougata Mukherjea
Mobile Web is characterized by two salient features, ubiquitous access to content and limited resources, like bandwidth and battery. Since most web pages are designed for the wired Internet, it is challenging to adapt the pages seamlessly to ensure a satisfactory mobile web experience. Content heavy web pages lead to longer load time on mobile browsers. Pre-defined load order of items in a page does not adapt to mobile browsing habits, where user looks for different snippets of a page to load under different contexts. Web content adaptation for mobile web has mainly focused on the user to define her preferences for content. We propose a framework where content creator is additionally included in guiding the adaptation. Allowing content creator to specify importance of items in a page also helps in factoring her incentives by pushing revenue generating content. We present mechanisms to enable cooperative content adaptation. Preliminary results show the efficacy of cooperative content adaptation in resource constrained mobile browsing scenario.

Posters: web mining

An effective class-centroid-based dimension reduction method for text classification BIBAFull-Text 223-224
  Guansong Pang; Huidong Jin; Shengyi Jiang
Motivated by the effectiveness of centroid-based text classification techniques, we propose a classification-oriented class-centroid-based dimension reduction (DR) method, called CentroidDR. Basically, CentroidDR projects high-dimensional documents into a low-dimensional space spanned by class centroids. On this class-centroid-based space, the centroid-based classifier essentially becomes CentroidDR plus a simple linear classifier. Other classification techniques, such as K-Nearest Neighbor (KNN) classifiers, can be used to replace the simple linear classifier to form much more effective text classification algorithms. Though CentroidDR is simple, non-parametric and runs in linear time, preliminary experimental results show that it can improve the accuracy of the classifiers and perform better than general DR methods such as Latent Semantic Indexing (LSI).
Harnessing web page directories for large-scale classification of tweets BIBAFull-Text 225-226
  Arkaitz Zubiaga; Heng Ji
Classification is paramount for an optimal processing of tweets, albeit performance of classifiers is hindered by the need of large sets of training data to encompass the diversity of contents one can find on Twitter. In this paper, we introduce an inexpensive way of labeling large sets of tweets, which can be easily regenerated or updated when needed. We use human-edited web page directories to infer categories from URLs contained in tweets. By experimenting with a large set of more than 5 million tweets categorized accordingly, we show that our proposed model for tweet classification can achieve 82% in accuracy, performing only 12.2% worse than for web page classification.
Scalable k-nearest neighbor graph construction based on greedy filtering BIBAFull-Text 227-228
  Youngki Park; Sungchan Park; Sang-goo Lee; Woosung Jung
K-Nearest Neighbor Graph (K-NNG) construction is a primitive operation in the field of Information Retrieval and Recommender Systems. However, existing approaches to K-NNG construction do not perform well as the number of nodes or dimensions scales up. In this paper, we present greedy filtering, an efficient and scalable algorithm for selecting the candidates for nearest neighbors by matching only the dimensions of large values. The experimental results show that our K-NNG construction scheme, based on greedy filtering, guarantees a high recall while also being 5 to 6 times faster than state-of-the-art algorithms for large, high-dimensional data.
Numeric query ranking approach BIBAFull-Text 229-230
  Jie Wu; Yi Liu; Ji-Rong Wen
We handle a special category of Web queries, queries containing numeric terms. We call them numeric queries. Motivated by some issues in ranking of numeric queries, we detect numeric sensitive queries by mining from retrieved documents using phrase operator. We also propose features based on numeric terms by extracting reliable numeric terms for each document. Finally, a ranking model is trained for numeric sensitive queries, combining proposed numeric-related features and traditional features. Experiments show that our model can significantly improve relevance for numeric sensitive queries.
Collaborative filtering meets next check-in location prediction BIBAFull-Text 231-232
  Defu Lian; Vincent W. Zheng; Xing Xie
With the increasing popularity of Location-based Social Networks, a vast amount of location check-ins have been accumulated. Though location prediction in terms of check-ins has been recently studied, the phenomena that users often check in novel locations has not been addressed. To this end, in this paper, we leveraged collaborative filtering techniques for check-in location prediction and proposed a short- and long-term preference model. We extensively evaluated it on two large-scale check-in datasets from Gowalla and Dianping with 6M and 1M check-ins, respectively, and showed that the proposed model can outperform the competing baselines.
TCRec: product recommendation via exploiting social-trust network and product category information BIBAFull-Text 233-234
  Yu Jiang; Jing Liu; Xi Zhang; Zechao Li; Hanqing Lu
In this paper, we develop a novel product recommendation method called TCRec, which takes advantage of consumer rating history record, social-trust network and product category information simultaneously. Compared experiments are conducted on two real-world datasets and outstanding performance is achieved, which demonstrates the effectiveness of TCRec.
Regional analysis of user interactions on social media in times of disaster BIBAFull-Text 235-236
  Takeshi Sakaki; Fujio Toriumi; Kosuke Shinoda; Kazuhiro Kazama; Satoshi Kurihara; Itsuki Noda; Yutaka Matsuo
Social media attract attention for sharing information, especially Twitter, which is now being used in times of disasters. In this paper, we perform regional analysis of user interactions on Twittter during the Great East Japan Earthquake and arrived at the following two conclusions:People diffused much more information after the earthquake, especially in the heavily-damaged areas; People communicated with nearby users but diffused information posted by distant users. We conclude that social media users changed their behavior to widely diffuse information.
Improving consensus clustering of texts using interactive feature selection BIBAFull-Text 237-238
  Ricardo M. Marcacini; Marcos A. Domingues; Solange O. Rezende
Consensus clustering and interactive feature selection are very useful methods to extract and manage knowledge from texts. While consensus clustering allows the aggregation of different clustering solutions into a single robust clustering solution, the interactive feature selection facilitates the incorporation of the users experience in text clustering tasks by selecting a set of high-level features. In this paper, we propose an approach to improve the robustness of consensus clustering using interactive feature selection. We have reported some experimental results on real-world datasets that show the effectiveness of our approach.

Big data & web applications demonstrations

Live migration of JavaScript web apps BIBAFull-Text 241-244
  James Lo; Eric Wohlstadter; Ali Mesbah
Due to the increasing complexity of web applications and emerging HTML5 standards, a large amount of runtime state is created and managed in the user's browser. While such complexity is desirable for user experience, it makes it hard for developers to implement mechanisms that provide users ubiquitous access to the data they create during application use. This work showcases Imagen, our implemented platform for browser session migration of JavaScript-based web applications. Session migration is the act of transferring a session between browsers at runtime. Without burden to developers, Imagen allows users to create a snapshot image that captures the runtime state needed to resume the session elsewhere. Our approach works completely in the JavaScript layer and we demonstrate that snapshots can be transferred between different browser vendors and hardware devices. The demo will illustrate our system's performance and interoperability using two HTML5 apps, four different browsers and three different devices.
Automated exploration and analysis of Ajax web applications with WebMole BIBAFull-Text 245-248
  Gabriel Le Breton; Fabien Maronnaud; Sylvain Hallé
WebMole is a browser-based tool that automatically and exhaustively explores all pages inside a web application. Contrarily to classical web crawlers, which only explore pages accessible through regular anchors, WebMole can find its way through Ajax applications that use JavaScript-triggered links, and handles state changes that do not involve a page reload. User-defined functions called oracles can be used to bound the range of pages explored by WebMole to specific parts of an application, as well as to evaluate Boolean test conditions on all visited pages. Overall, WebMole can prove a more flexible alternative to automated testing suites such as Selenium WebDriver.
Analyzing the suitability of web applications for a single-user to multi-user transformation BIBAFull-Text 249-252
  Matthias Heinrich; Franz Lehmann; Franz Josef Grüneberger; Thomas Springer; Martin Gaedke
Multi-user web applications like Google Docs or Etherpad are crucial to efficiently support collaborative work (e.g. jointly create texts, graphics, or presentations). Nevertheless, enhancing single-user web applications with multi-user capabilities (i.e. document synchronization and conflict resolution) is a time-consuming and intricate task since traditional approaches adopting concurrency control libraries (e.g. Apache Wave) require numerous scattered source code changes. Therefore, we devised the Generic Collaboration Infrastructure (GCI) [8] that is capable of converting single-user web applications non-invasively into collaborative ones, i.e. no source code changes are required. In this paper, we present a catalog of vital application properties that allows determining if a web application is suitable for a GCI transformation. On the basis of the introduced catalog, we analyze 12 single-user web applications and show that 6 are eligible for a GCI transformation. Moreover, we demonstrate (1) the transformation of one qualified application, namely, the prominent text editor TinyMCE, and (2) showcase the resulting multi-user capabilities. Both demo parts are illustrated in a dedicated screencast that is available at http://vsr.informatik.tu-chemnitz.de/demo/TinyMCE/.
Crowdsourcing MapReduce: JSMapReduce BIBAFull-Text 253-256
  Philipp Langhans; Christoph Wieser; François Bry
JSMapReduce is an implementation of MapReduce which exploits the computing power available in the computers of the users of a web platform by giving tasks to the JavaScript engines of their web browsers. This article describes the implementation of JSMapReduce exploiting HTML 5 features, the heuristics it uses for distributing tasks to workers, and reports on an experimental evaluation of JSMapReduce.
Large-scale social-media analytics on stratosphere BIBAFull-Text 257-260
  Christoph Boden; Marcel Karnstedt; Miriam Fernandez; Volker Markl
The importance of social-media platforms and online communities -- in business as well as public context -- is more and more acknowledged and appreciated by industry and researchers alike. Consequently, a wide range of analytics has been proposed to understand, steer, and exploit the mechanics and laws driving their functionality and creating the resulting benefits. However, analysts usually face significant problems in scaling existing and novel approaches to match the data volume and size of modern online communities. In this work, we propose and demonstrate the usage of the massively parallel data processing system Stratosphere, based on second order functions as an extended notion of the MapReduce paradigm, to provide a new level of scalability to such social-media analytics. Based on the popular example of role analysis, we present and illustrate how this massively parallel approach can be leveraged to scale out complex data-mining tasks, while providing a programming approach that eases the formulation of complete analytical workflows.
Optimizing RDF(S) queries on cloud platforms BIBAFull-Text 261-264
  HyeongSik Kim; Padmashree Ravindra; Kemafor Anyanwu
Scalable processing of Semantic Web queries has become a critical need given the rapid upward trend in availability of Semantic Web data. The MapReduce paradigm is emerging as a platform of choice for large scale data processing and analytics due to its ease of use, cost effectiveness, and potential for unlimited scaling. Processing queries on Semantic Web triple models is a challenge on the mainstream MapReduce platform called Apache Hadoop, and its extensions such as Pig and Hive. This is because such queries require numerous joins which leads to lengthy and expensive MapReduce workflows. Further, in this paradigm, cloud resources are acquired on demand and the traditional join optimization machinery such as statistics and indexes are often absent or not easily supported.
   In this demonstration, we will present RAPID+, an extended Apache Pig system that uses an algebraic approach for optimizing queries on RDF data models including queries involving inferencing. The basic idea is that by using logical and physical operators that are more natural to MapReduce processing, we can reinterpret such queries in a way that leads to more concise execution workflows and small intermediate data footprints that minimize disk I/Os and network transfer overhead. RAPID+ evaluates queries using the Nested TripleGroup Data Model and Algebra (NTGA). The demo will show comparative performance of NTGA query plans vs. relational algebra-like query plans used by Apache Pig and Hive.
TagVisor: extending web pages with interaction events to support presentation in digital signage BIBAFull-Text 265-268
  Marcio dos Santos Galli; Eduardo Pezutti Beletato Santos
New interaction experiences are fundamentally changing the way we interact with the web. Emerging touch-based devices and a variety of web-connected appliances represents challenges that prevents the seamless reach of web resources originally tailored for the standard browser experience. This paper explores how web pages can be re-purposed and become interactive presentations that effectively supports communication in scenarios such as digital signage and other presentation use cases. We will cover the TagVisor project which is a JavaScript run-time that uses modern animation effects and provides an HTML5 extension approach to support the authoring of visual narratives using plain web pages.
Complementary assistance mechanisms for end user mashup composition BIBAFull-Text 269-272
  Soudip Roy Chowdhury; Olexiy Chudnovskyy; Matthias Niederhausen; Stefan Pietschmann; Paul Sharples; Florian Daniel; Martin Gaedke
Despite several efforts for simplifying the composition process, learning efforts required for using existing mashup editors to develop mashups remain still high. In this paper, we describe how this barrier can be lowered by means of an assisted development approach that seamlessly integrates automatic composition and interactive pattern recommendation techniques into existing mashup platforms for supporting easy mashup development by end users. We showcase the use of such an assisted development environment in the context of an open-source mashup platform Apache Rave. Results of our user studies demonstrate the benefits of our approach for end user mashup development.

Social media, crowdsourcing & services demonstrations

uTrack: track yourself! monitoring information on online social media BIBAFull-Text 273-276
  Tiago Rodrigues; Prateek Dewan; Ponnurangam Kumaraguru; Raquel Melo Minardi; Virgílio Almeida
The past one decade has witnessed an astounding outburst in the number of online social media (OSM) services, and a lot of these services have enthralled millions of users across the globe. With such tremendous number of users, the amount of content being generated and shared on OSM services is also enormous. As a result, trying to visualize all this overwhelming amount of content, and gain useful insights from it has become a challenge. In this work, we present uTrack, a personalized web service to analyze and visualize the diffusion of content shared by users across multiple OSM platforms. To the best of our knowledge, there exists no work which concentrates on monitoring information diffusion for personal accounts. Currently, uTrack monitors and supports logging in from Facebook, Twitter, and Google+. Once granted permissions by the user, uTrack monitors all URLs (like videos, photos, news articles) the user has shared in all OSM services supported, and generates useful visualizations and statistics from the collected data.
DFT-extractor: a system to extract domain-specific faceted taxonomies from wikipedia BIBAFull-Text 277-280
  Bifan Wei; Jun Liu; Jian Ma; Qinghua Zheng; Wei Zhang; Boqin Feng
Extracting faceted taxonomies from the Web has received increasing attention in recent years from the web mining community. We demonstrate in this study a novel system called DFT-Extractor, which automatically constructs domain-specific faceted taxonomies from Wikipedia in three steps: 1) It crawls domain terms from Wikipedia by using a modified topical crawler. 2) Then it exploits a classification model to extract hyponym relations with the use of motif-based features. 3) Finally, it constructs a faceted taxonomy by applying a community detection algorithm and a group of heuristic rules. DFT-Extractor also provides a graphical user interface to visualize the learned hyponym relations and the tree structure of taxonomies.
Temporal summarization of event-related updates in wikipedia BIBAFull-Text 281-284
  Mihai Georgescu; Dang Duc Pham; Nattiya Kanhabua; Sergej Zerr; Stefan Siersdorfer; Wolfgang Nejdl
Wikipedia is a free multilingual online encyclopedia covering a wide range of general and specific knowledge. Its content is continuously maintained up-to-date and extended by a supporting community. In many cases, real-world events influence the collaborative editing of Wikipedia articles of the involved or affected entities. In this paper, we present Wikipedia Event Reporter, a web-based system that supports the entity-centric, temporal analytics of event-related information in Wikipedia by analyzing the whole history of article updates. For a given entity, the system first identifies peaks of update activities for the entity using burst detection and automatically extracts event-related updates using a machine-learning approach. Further, the system determines distinct events through the clustering of updates by exploiting different types of information such as update time, textual similarity, and the position of the updates within an article. Finally, the system generates the meaningful temporal summarization of event-related updates and automatically annotates the identified events in a timeline.
Live topic generation from event streams BIBAFull-Text 285-288
  Vuk Milicic; Giuseppe Rizzo; José Luis Redondo Garcia; Raphaël Troncy; Thomas Steiner
Social platforms constantly record streams of heterogeneous data about human's activities, feelings, emotions and conversations opening a window to the world in real-time. Trends can be computed but making sense out of them is an extremely challenging task due to the heterogeneity of the data and its dynamics making often short-lived phenomena. We develop a framework which collects microposts shared on social platforms that contain media items as a result of a query, for example a trending event. It automatically creates different visual storyboards that reflect what users have shared about this particular event. More precisely it leverages on: (i) visual features from media items for near-deduplication, and (ii) textual features from status updates to interpret, cluster, and visualize media items. A screencast showing an example of these functionalities is published at: http://youtu.be/8iRiwz7cDYY while the prototype is publicly available at http://mediafinder.eurecom.fr.
Serefind: a social networking website for classifieds BIBAFull-Text 289-292
  Pramod Verma
This paper presents the design and implementation of a social networking website for classifieds, called Serefind. We designed search interfaces with focus on security, privacy, usability, design, ranking, and communications. We deployed this site at the Johns Hopkins University, and the results show it can be used as a self-sustaining classifieds site for public or private communities.
MASFA: mass-collaborative faceted search for online communities BIBAFull-Text 293-296
  Seth B. Cleveland; Byron J. Gao
Faceted search combines faceted navigation with direct keyword search, providing exploratory search capacities allowing progressive query refinement. It has become the de facto standard for e-commerce and product-related websites such as amazon.com and ebay.com. However, faceted search has not been effectively incorporated into non-commercial online community portals such as craigslist.org. This is mainly because unlike keyword search, faceted search systems require metadata that constantly evolve, making them very costly to build and maintain. In this paper, we propose a framework MASFA that utilizes a set of non-domain-specific techniques to build and maintain effective, portable, and cost-free faceted search systems in a mass-collaborative manner. We have implemented and deployed the framework on selected categories of Craigslist to demonstrate its utility.
ALFRED: crowd assisted data extraction BIBAFull-Text 297-300
  Valter Crescenzi; Paolo Merialdo; Disheng Qiu
The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowdsourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We demonstrate ALFRED, a wrapper inference system supervised by the workers of a crowdsourcing platform. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. ALFRED includes several original features: it automatically selects a representative sample set from the input collection of pages; in order to minimize the wrapper inference costs, it dynamically sets the expressiveness of the wrapper formalism and it adopts an active learning algorithm to select the queries posed to the crowd; it is able to manage inaccurate answers that can be provided by the workers engaged by crowdsourcing platforms.
SHERLOCK: a system for location-based services in wireless environments using semantics BIBAFull-Text 301-304
  Roberto Yus; Eduardo Mena; Sergio Ilarri; Arantza Illarramendi
Nowadays people are exposed to huge amounts of information that are generated continuously. However, current mobile applications, Web pages, and Location-Based Services (LBSs) are designed for specific scenarios and goals. In this demo we show the system SHERLOCK, which searches and shares up-to-date knowledge from nearby devices to relieve the user from knowing and managing such knowledge directly. Besides, the system guides the user in the process of selecting the service that best fits his/her needs in the given context.

Rich media, information extraction, & search demonstrations

Tailored news in the palm of your hand: a multi-perspective transparent approach to news recommendation BIBAFull-Text 305-308
  Mozhgan Tavakolifard; Jon Atle Gulla; Kevin C. Almeroth; Jon Espen Ingvaldesn; Gaute Nygreen; Erik Berg
Mobile news recommender systems help users retrieve news that is relevant in their particular context and can be presented in ways that require minimal user interaction. In spite of the availability of contextual information about mobile users, though, current mobile news applications employ rather simple strategies for news recommendation. Our multi-perspective approach unifies temporal, locational, and preferential information to provide a more fine-grained recommendation strategy. This demo paper presents the implementation of our solution to efficiently recommend specific news articles from a large corpus of newly-published press releases in a way that closely matches a reader's reading preferences.
Connected media experiences: web based interactive video using linked data BIBAFull-Text 309-312
  Lyndon Nixon; Matthias Bauer; Cristian Bara
This demo submission presents a set of tools and an extended framework with API for enabling the semantically empowered enrichment of online video with Web content. As audiovisual media is increasingly transmitted online, new services deriving added value from such material can be imagined. For example, combining it with other material elsewhere on the Web which is related to it or enhances it in a meaningful way, to the benefit of the owner of the original content, the providers of the content enhancing it and the end consumer who can access and interact with these new services. Since the services are built around providing new experiences through connecting different related media together, we consider such services to be Connected Media Experiences (ConnectME). This paper presents a toolset for ConnectME -- an online annotation tool for video and a HTML5-based enriched video player -- as well as the ConnectME framework which enables these media experiences to be generated on the server side with semantic technology.
Radialize: a tool for social listening experience on the web based on radio station programs BIBAFull-Text 313-316
  Álvaro R., Jr. Pereira; Diego Dutra; Milton, Jr. Stiilpen; Alex Amorim Dutra; Felipe Martins Melo; Paulo H. C. Mendonça; Ângelo Magno de Jesus; Kledilson Ferreira
Radialize represents a service for listening to music and radio programs through the web. The service allows the discovery of the content being played by radio stations on the web, either by managing explicit information made available by those stations or by means of our technology for automatic recognition of audio content in a stream. Radialize then offers a service in which the user can search, be recommended, and provide feedback on artists and songs being played in traditional radio stations, either explicitly or implicitly, in order to compose an individual profile. The recommender system utilizes every user interaction as a data source, as well as the similarity abstraction extracted out of the radios' musical programs, making use of the wisdom of crowds implicitly present in the radio programs.
FANS: face annotation by searching large-scale web facial images BIBAFull-Text 317-320
  Steven C. H. Hoi; Dayong Wang; I. Yeu Cheng; Elmer Weijie Lin; Jianke Zhu; Ying He; Chunyan Miao
Auto face annotation is an important technique for many real-world applications, such as online photo album management, new video summarization, and so on. It aims to automatically detect human faces from a photo image and further name the faces with the corresponding human names. Recently, mining web facial images on the internet has emerged as a promising paradigm towards auto face annotation. In this paper, we present a demonstration system of search-based face annotation: FANS -- Face ANnotation by Searching large-scale web facial images. Given a query facial image for annotation, we first retrieve a short list of the most similar facial images from a web facial image database, and then annotate the query facial image by mining the top-ranking facial images and their corresponding labels with sparse representation techniques. Our demo system was built upon a large-scale real-world web facial image database with a total of 6,025 persons and about 1 million facial images. This paper demonstrates the potential of searching and mining web-scale weakly labeled facial images on the internet to tackle the challenging face annotation problem, and addresses some open problems for future exploration by researchers in web community. The live demo of FANS is available online at http://msm.cais.ntu.edu.sg/FANS/.
Search the past with the Portuguese web archive BIBAFull-Text 321-324
  Daniel Gomes; David Cruz; João Miranda; Miguel Costa; Simão Fontes
The web was invented to quickly exchange data between scientists, but it became a crucial communication tool to connect the world. However, the web is extremely ephemeral. Most of the information published online becomes quickly unavailable and is lost forever. There are several initiatives worldwide that struggle to archive information from the web before it vanishes. However, search mechanisms to access this information are still limited and do not satisfy their users who demand performance similar to live-web search engines.
   This demo presents the Portuguese Web Archive, which enables search over 1.2 billion files archived from 1996 to 2012. It is the largest full-text searchable web archive publicly available [17]. The software developed to support this service is also publicly available as a free open source project at Google Code, so that it can be reused and enhanced by other web archivists. A short video about the Portuguese Web Archive is available at vimeo.com/59507267. The service can be tried live at archive.pt.
Inside YOGO2s: a transparent information extraction architecture BIBAFull-Text 325-328
  Joanna Biega; Erdal Kuzey; Fabian M. Suchanek
YAGO [9, 6] is one of the largest public ontologies constructed by information extraction. In a recent refactoring called YAGO2s, the system has been given a modular and completely transparent architecture. In this demo, users can see how more than 30 individual modules of YAGO work in parallel to extract facts, to check facts for their correctness, to deduce facts, and to merge facts from different sources. A GUI allows users to play with different input files, to trace the provenance of individual facts to their sources, to change deduction rules, and to run individual extractors. Users can see step by step how the extractors work together to combine the individual facts to the coherent whole of the YAGO ontology.
SPARQL2NL: verbalizing sparql queries BIBAFull-Text 329-332
  Axel-Cyrille Ngonga Ngomo; Lorenz Bühmann; Christina Unger; Jens Lehmann; Daniel Gerber
Linked Data technologies are now being employed by a large number of applications. While experts can query the backend of these applications using the standard query language SPARQL, most lay users lack the expertise necessary to proficiently interact with these applications. Consequently, non-expert users usually have to rely on forms, query builders, question answering or keyword search tools to access RDF data. Yet, these tools are usually unable to make the meaning of the queries they generate plain to lay users, making it difficult for these users to i) assess the correctness of the query generated out of their input, and ii) to adapt their queries or iii) to choose in an informed manner between possible interpretations of their input.
   We present SPARQL2NL, a generic approach that allows verbalizing SPARQL queries, i.e., converting them into natural language. In addition to generating verbalizations, our approach can also explain the output of queries by providing a natural-language description of the reasons that led to each element of the result set being selected. Our evaluation of SPARQL2NL within a large-scale user survey shows that SPARQL2NL generates complete and easily understandable natural language descriptions. In addition, our results suggest that even SPARQL experts can process the natural language representation of SPARQL queries computed by our approach more efficiently than the corresponding SPARQL queries. Moreover, non-experts are enabled to reliably understand the content of SPARQL queries. Within the demo, we present the results generated by our approach on arbitrary questions to the DBpedia and MusicBrainz datasets. Moreover, we present how our framework can be used to explain results of SPARQL queries in natural language.
G-path: flexible path pattern query on large graphs BIBAFull-Text 333-336
  Yiyuan Bai; Chaokun Wang; Yuanchi Ning; Hanzhao Wu; Hao Wang
With the socialization trend of web sites and applications, the techniques of effective management of graph-structured data have become one of the most important modern web technologies. In this paper, we present a system of path query on large graphs, known as G-Path. Based on Hadoop distributed framework and bulk synchronized parallel model, the system can process generic queries without preprocessing or building indices. To demonstrate the system, we developed a web-based application which allows searching entities and relationships on a large social network, e.g., DBLP publication network or Twitter dataset. With the flexibility of G-Path, the application is able to handle different kinds of queries. For example, a user may want to search for a publication graph of an author while another user may want to search for all publications of the author's co-authors. All these queries can be done by an interactive user interface and the results will be shown in a visual graph.

Doctoral consortium

Mockup driven web development BIBAFull-Text 337-342
  Edward Benson
Dynamic web development still borrows heavily from its origins in CGI scripts: modern web applications are largely designed and developed as programs that happen to output HTML. This thesis proposes to investigate the idea taking a mockup-centric approach instead, in which self-contained, full page web mockups are the central artifact driving the application development process. In some cases, these mockups are sufficient to infer the dynamic application structure completely.
   This approach to mockup driven development is made possible by the development of a language the thesis develops, called Cascading Tree Sheets (CTS), that enables a mockup to be annotated with enough information so that many common web development tasks and workflows can be eliminated or vastly simplified. CTS describes and encapsulates a web page's design structure the same way CSS describes its styles. This enables mockups to serve as the input of a web application rather than simply a design artifact. Using this capability, I will study the feasibility and usability of mockup driven development for a range of novice and expert authorship tasks. The thesis aims to finish by demonstrating that the functionality of a domain-specific content management system can be inferred automatically from site mockups.
Structured summarization for news events BIBAFull-Text 343-348
  Giang Binh Tran
Helping users to understand the news is an acute problem nowadays as the users are struggling to keep up with tremendous amount of information published every day in the Internet. In this research, we focus on modelling the content of news events by their semantic relations with other events, and generating structured summarization.
Multimedia information retrieval on the social web BIBAFull-Text 349-354
  Teresa Bracamonte
Efforts have been made to obtain more accurate results for multimedia searches on the Web. Nevertheless, not all multimedia objects have related text descriptions available. This makes bridging the semantic gap more difficult. Approaches that combine context and content information of multimedia objects are the most popular for indexing and later retrieving these objects. However, scaling these techniques to Web environments is still an open problem. In this thesis, we propose the use of user-generated content (UGC) from the Web and social platforms as well as multimedia content information to describe the context of multimedia objects. We aim to design tag-oriented algorithms to automatically tag multimedia objects, filter irrelevant tags, and cluster tags in semantically-related groups. The novelty of our proposal is centered on the design of Web-scalable algorithms that enrich multimedia context using the social information provided by users as a result of their interaction with multimedia objects. We validate the results of our proposal with a large-scale evaluation in crowdsourcing platforms.
Effective analysis, characterization, and detection of malicious web pages BIBAFull-Text 355-360
  Birhanu Eshete
The steady evolution of the Web has paved the way for miscreants to take advantage of vulnerabilities to embed malicious content into web pages. Up on a visit, malicious web pages steal sensitive data, redirect victims to other malicious targets, or cease control of victim's system to mount future attacks. Approaches to detect malicious web pages have been reactively effective at special classes of attacks like drive-by-downloads. However, the prevalence and complexity of attacks by malicious web pages is still worrisome. The main challenges in this problem domain are (1) fine-grained capturing and characterization of attack payloads (2) evolution of web page artifacts and (3) exibility and scalability of detection techniques with a fast-changing threat landscape. To this end, we proposed a holistic approach that leverages static analysis, dynamic analysis, machine learning, and evolutionary searching and optimization to effectively analyze and detect malicious web pages. We do so by: introducing novel features to capture fine-grained snapshot of malicious web pages, holistic characterization of malicious web pages, and application of evolutionary techniques to fine-tune learning-based detection models pertinent to evolution of attack payloads. In this paper, we present key intuition and details of our approach, results obtained so far, and future work.
Identifying, understanding and detecting recurring, harmful behavior patterns in collaborative Wikipedia editing: doctoral proposal BIBAFull-Text 361-366
  Fabian Flöck
In this doctoral proposal, we describe an approach to identify recurring, collective behavioral mechanisms in the collaborative interactions of Wikipedia editors that have the potential to undermine the ideals of quality, neutrality and completeness of article content. We outline how we plan to parametrize these patterns in order to understand their emergence and evolution and measure their effective impact on content production in Wikipedia. On top of these results we intend to build end-user tools to increase the transparency of the evolution of articles and equip editors with more elaborated quality monitors. We also sketch out our evaluation plans and report on already accomplished tasks.
Ontology based feature level opinion mining for Portuguese reviews BIBAFull-Text 367-370
  Larissa A. Freitas; Renata Vieira
This paper presents a thesis whose goal is to propose and evaluate methods to identify polarity in Portuguese user generated reviews according to features described in domain ontologies (experiments will consider movie and hotel ontologies Movie Ontology1 and Hontology2).
A machine-to-machine architecture to merge semantic sensor measurements BIBAFull-Text 371-376
  Amelie Gyrard
The emerging field Machine-to-Machine (M2M) enables machines to communicate with each other without human intervention. Existing semantic sensor networks are domain-specific and add semantics to the context. We design a Machine-to-Machine (M2M) architecture to merge heterogeneous sensor networks and we propose to add semantics to the measured data rather than to the context. This architecture enables to: (1) get sensor measurements, (2) enrich sensor measurements with semantic web technologies, domain ontologies and the Link Open Data, and (3) reason on these semantic measurements with semantic tools, machine learning algorithms and recommender systems to provide promising applications.
Deep web entity monitoring BIBFull-Text 377-382
  Mohamamdreza Khelghati; Djoerd Hiemstra; Maurice Van Keulen
Context mining and integration into predictive web analytics BIBAFull-Text 383-388
  Julia Kiseleva
Predictive Web Analytics is aimed at understanding behavioural patterns of users of various web-based applications: e-commerce, ubiquitous and mobile computing, and computational advertising. Within these applications business decisions often rely on two types of predictions: an overall or particular user segment demand predictions and individualised recommendations for visitors. Visitor behaviour is inherently sensitive to the context, which can be defined as a collection of external factors. Context-awareness allows integrating external explanatory information into the learning process and adapting user behaviour accordingly. The importance of context-awareness has been recognised by researchers and practitioners in many disciplines, including recommendation systems, information retrieval, personalisation, data mining, and marketing. We focus on studying ways of context discovery and its integration into predictive analytics.
A proximity-based fallback model for hybrid web recommender systems BIBAFull-Text 389-394
  Jaeseok Myung
Although there are numerous websites that provide recommendation services for various items such as movies, music, and books, most of studies on recommender systems only focus on one specific item type. As recommender sites expand to cover several types of items, though, it is important to build a hybrid web recommender system that can handle multiple types of items.
   The switch hybrid recommender model provides a solution to this problem by choosing an appropriate recommender system according to given selection criteria, thereby facilitating cross-domain recommendations supported by individual recommender systems. This paper seeks to answer the question of how to deal with situations where no appropriate recommender system exists to deal with a required type of item. In such cases, the switch model cannot generate recommendation results, leading to the need for a fallback model that can satisfy most users most of the time.
   Our fallback model exploits a graph-based proximity search, ranking every entity on the graph according to a given proximity measure. We study how to incorporate the fallback model into the switch model, and propose a general architecture and simple algorithms for implementing these ideas. Finally, we present the results of our research result and discuss remaining challenges and possibilities for future research.
Analyzing linguistic structure of web search queries BIBAFull-Text 395-400
  Rishiraj Saha Roy
It is believed that Web search queries are becoming more structurally complex over time. However, there has been no systematic study that quantifies such characteristics. In this thesis, we propose that queries are evolving into a unique linguistic system. We demonstrate proof of this hypothesis by examining the structure of Web queries by applying well-established techniques from natural language understanding. Preliminary results of these experiments show quantitative and qualitative proof that queries are not just some form of text between random sequences of words and natural language -- they have distinct properties of their own.
Understanding and analysing microblogs BIBAFull-Text 401-406
  Pinar Yanardag Delul
Microblogging is a form of blogging where posts typically consist of short content such as quick comments, phrases, URLs, or media, like images and videos. Because of the fast and compact nature of microblogs, users have adopted them for novel purposes, including sharing personal updates, spreading breaking news, promoting political views, marketing and tracking real time events. Thus, finding relevant information sources out of the rapidly growing content is an essential task.
   In this paper, we study the problem of understanding and analysing microblogs. We present a novel 2-stage framework to find potentially relevant content by extracting topics from the tweets and by taking advantage of submodularity.LILE2013 Welcome and organization

LILE'13 keynote talk

Linking data in and outside a scientific publishing house BIBAFull-Text 411-412
  Sweitze Roffel
Publishing has undergone many changes since the 1960's, often driven by rapid technological development. Technology impacts the creation and dissemination of knowledge only to a certain extent, and in this talk I'll try to give a publisher's perspective of some technological drivers impacting Academic publishing today, and how the many actors involved are learning to cooperate as well as compete in an increasingly distributed environment to better turn information into knowledge. Technically, organizationally, and with regard to shared standards and infrastructure.
   Publishing has been called many different things by many different people. A simple definition could be that publishing is 'organizing content', so the focus of this talk will be on Elsevier's current use of Linked Data & Semantic technology in organizing scientific content, including some early lessons learned.
   This view from a publisher aims to help the discussion on how we can all contribute to better disseminate and promote the enormous creativity made through core research contributions.

LILE'13 session 1

Exploring student predictive model that relies on institutional databases and open data instead of traditional questionnaires BIBAFull-Text 413-418
  Farhana Sarker; Thanassis Tiropanis; Hugh C. Davis
Research in student retention and progression to completion is traditionally survey-based, where researchers collect data through questionnaires and interviewing students. The major issues with survey-based study are the potentially low response rates and cost. Nevertheless, a large number of datasets that could inform the questions that students are explicitly asked in surveys is commonly available in the external open datasets. This paper describes a new student predictive model for student progression that relies on the data available in institutional internal databases and external open data, without the need for surveys. The results of empirical study for undergraduate students in their first year of study shows that this model can perform as well as or even out-perform traditional survey-based ones.
Towards integration of web data into a coherent educational data graph BIBAFull-Text 419-424
  Davide Taibi; Besnik Fetahu; Stefan Dietze
Personalisation, adaptation and recommendation are central aims of Technology Enhanced Learning (TEL) environments. In this context, information retrieval and clustering techniques are more and more often applied to filter and deliver learning resources according to user preferences and requirements. However, the suitability and scope of possible recommendations is fundamentally dependent on the available data, such as metadata about learning resources as well as users. However, quantity and quality of both is still limited. On the other hand, throughout the last years, the Linked Data (LD) movement has succeeded to provide a vast body of well-interlinked and publicly accessible Web data. This in particular includes Linked Data of explicit or implicit educational nature. In this paper, we propose a large-scale educational dataset which has been generated by exploiting Linked Data methods together with clustering and interlinking techniques to extract import and interlink a wide range of educationally relevant data. We also introduce a set of reusable techniques which were developed to realise scalable integration and alignment of Web data in educational settings.
Finding relevant missing references in learning courses BIBAFull-Text 425-430
  Patrick Siehndel; Ricardo Kawase; Asmelash Teka Hadgu; Eelco Herder
Reference sites play an increasingly important role in learning processes. Teachers use these sites in order to identify topics that should be covered by a course or a lecture. Learners visit online encyclopedias and dictionaries to find alternative explanations of concepts, to learn more about a topic, or to better understand the context of a concept. Ideally, a course or lecture should cover all key concepts of the topic that it encompasses, but often time constraints prevent complete coverage. In this paper, we propose an approach to identify missing references and key concepts in a corpus of educational lectures. For this purpose, we link concepts in educational material to the organizational and linking structure of Wikipedia. Identifying missing resources enables learners to improve their understanding of a topic, and allows teachers to investigate whether their learning material covers all necessary concepts.

LILE'13 session 2

Interactive learning resources and linked data for online scientific experimentation BIBAFull-Text 431-434
  Alexander Mikroyannidis; John Domingue
There is currently a huge potential for eLearning in several new online learning initiatives like Massive Open Online Courses (MOOCs) and Open Educational Resources (OERs). These initiatives enable learners to self-regulate their learning by providing them with an abundant amount of free learning materials of high quality. This paper presents FORGE, a new European initiative for online learning using Future Internet Research and Experimentation (FIRE) facilities. FORGE is a step towards turning FIRE into a pan-European educational platform for Future Internet through Linked Data. This will benefit learners and educators by giving them both access to world class facilities in order to carry out experiments on e.g. new internet protocols. In turn, this supports constructivist and self-regulated learning approaches, through the use of interactive learning resources, such as eBooks.
Learning from quizzes using intelligent learning companions BIBAFull-Text 435-438
  Danica Damljanovic; David Miller; Daniel O'Sullivan
It is widely recognised that engaging games can have a profound impact on learning. Integrating a conversational Artificial Intelligence (AI) into the mix makes the experience of learning even more engaging and enriching. In this paper we describe a conversational agent which is built with the purpose of acting as a personal tutor. The tutor can prompt, question, stimulate and guide a learner and then adapt exercises and challenges to specific needs. We illustrate how automatic generation of quizzes can be used to build learning exercises and activities.
Linked data selectors BIBAFull-Text 439-444
  Kai Michael Höver; Max Mühlhäuser
In the world of Linked Data, HTTP URIs are names. A URI is dereferenced to obtain a copy or description of the referred resource. If only a fragment of a resource should be referred, pointing to the whole resource is not sufficient. Therefore, it is necessary to be able to refer to fragments of resources, and to name them with URIs to interlink them in the Web of Data. This is especially helpful in the educational context where learning processes including discussion and social interaction demand for exact references and granular selections of media. This paper presents the specification of Linked Data Selectors, an OWL ontology for describing dereferenceable fragments of Web resources.
OpenScout: harvesting business and management learning objects from the web of data BIBAFull-Text 445-450
  Ricardo Kawase; Marco Fisichella; Katja Niemann; Vassilis Pitsilis; Aristides Vidalis; Philipp Holtkamp; Bernardo Nunes
Already existing open educational resources in the field of Business and Management have a high potential for enterprises to address the increasing training needs of their employees. However, it is difficult to act on OERs as some data is hidden. In the meanwhile, numerous repositories provide Linked Open Data on this field. Though, users have to search a number of repositories with heterogeneous interfaces in order to retrieve the desired content. In this paper, we present the strategies to gather heterogeneous learning objects from the Web of Data, and we provide an overview of the benefits of the OpenScout platform. Despite the fact that not all data repositories strictly follow Linked Data principles, OpenScout addressed individual variations in order to harvest, align, and provide a single end-point. In the end, OpenScout provides a full-fledged environment that leverages on the Linked Open Data available on the Web and additionally exposes it in an homogeneous format.LiME'13 Welcome and organization

LIME'13 keynote talk

The importance of linked media to the future web: lime 2013 keynote talk -- a proposal for the linked media research agenda BIBAFull-Text 455-456
  Lyndon Nixon
If the future Web will be able to fully leverage the scale and quality of online media, a Web scale layer of structured, interlinked media annotations is needed, which we will call Linked Media, inspired by the Linked Data movement for making structured, interlinked descriptions of resources better available online. Mobile and tablet devices, as well as connected TVs, introduce novel application domains that will benefit from broad understanding and acceptance of Linked Media standards. In the keynote, I will provide an overview of current practices and specification efforts in the domain of video and Web content integration, drawing from the LinkedTV1 and MediaMixer2 projects. From this, I will present a vision for a Linked Media layer on the future Web will can empower new media-centric applications in a world of ubiquitous online multimedia.

LIME'13 technical presentations

Linking inside a video collection: what and how to measure? BIBAFull-Text 457-460
  Robin Aly; Roeland J. F. Ordelman; Maria Eskevich; Gareth J. F. Jones; Shu Chen
Although linking video to additional information sources seems to be a sensible approach to satisfy information needs of user, the perspective of users is not yet analyzed on a fundamental level in real-life scenarios. However, a better understanding of the motivation of users to follow links in video, which anchors users prefer to link from within a video, and what type of link targets users are typically interested in, is important to be able to model automatic linking of audiovisual content appropriately. In this paper we report on our methodology towards eliciting user requirements with respect to video linking in the course of a broader study on user requirements in searching and a series of benchmark evaluations on searching and linking.
Using explicit discourse rules to guide video enrichment BIBAFull-Text 461-464
  Michiel Hildebrand; Lynda Hardman
Video content analysis and named entity extraction are increasingly used to automatically generate content annotations for TV programs. A potential use of these annotations is to provide an entry point to background information that users can consume on a second screen. Automatic enrichments are, however, meaningless when it is unclear to the user what they can do with them and why they would. We propose to contextualize the annotations by an explicit representation of discourse in the form of scene templates. Through content rules these templates are populated with the relevant annotations. We illustrate this idea with an example video and annotations generated in the LinkedTV1 project.
Second screen interaction: an approach to infer tv watcher's interest using 3d head pose estimation BIBAFull-Text 465-468
  Julien Leroy; François Rocca; Matei Mancas; Bernard Gosselin
In this paper, we present our "work-in-progress" approach to implicitly track user interaction and infer the interest a user can have for TV media. The aim is to identify moments of attentive focus, noninvasively and continuously, to dynamically improve the user profile by detecting which annotated media have drawn the user attention. Our method is based on the detection and estimation of face pose in 3D using a consumer depth camera. This allows us to determine when a user is or not looking at his television. This study is realized in the scenario of second screen interaction (tablet, smartphone), a behavior that has become common for spectators. We present our progress on the system and its integration in the LinkedTV project.
Enriching media fragments with named entities for video classification BIBAFull-Text 469-476
  Yunjia Li; Giuseppe Rizzo; José Luis Redondo García; Raphaël Troncy; Mike Wald; Gary Wills
With the steady increase of videos published on media sharing platforms such as Dailymotion and YouTube, more and more efforts are spent to automatically annotate and organize these videos. In this paper, we propose a framework for classifying video items using both textual features such as named entities extracted from subtitles, and temporal features such as the duration of the media fragments where particular entities are spotted. We implement four automatic machine learning algorithms for multiclass classification problems, namely Logistic Regression (LG), K-Nearest Neighbour (KNN), Naive Bayes (NB) and Support Vector Machine (SVM). We study the temporal distribution patterns of named entities extracted from 805 Dailymotion videos. The results show that the best performance using the entity distribution is obtained with KNN (overall accuracy of 46.58%) while the best performance using the temporal distribution of named entities for each type is obtained with SVM (overall accuracy of 43.60%). We conclude that this approach is promising for automatically classifying online videos.

LIME'13 demonstrations

DataConf: enriching conference publications with a mobile mashup application BIBAFull-Text 477-478
  Lionel Médini; Florian Bâcle; Hoang Duy Tan Nguyen
This paper describes a mobile Web application that allows browsing conference publications, their authors, authors' organizations, and even authors' other publications or publications related to the same keywords. It queries a main SPARQL endpoint that serves the conference metadata set, as well as other endpoints to enrich and explore data. It provides extra functions, such as flashing a publication QR code from the Web browser, accessing external resources about the publications, and it can be linked to external Web services. This application exploits the Linked Data paradigm and performs client-side reasoning. It follows recent W3C technical advances and as a mashup, requires few server resources. It can easily be deployed for any conference with available metadata on the Web.
The chrooma+ approach to enrich video content using HTML5 BIBAFull-Text 479-480
  Philipp Oehme; Michael Krug; Fabian Wiedemann; Martin Gaedke
The Internet has become an important source for media content. Content types are not limited to text and pictures but also include video and audio. Currently audiovisual media is presented as it is. However, these media do not integrate the huge amount of related information, which is available on the Web. In this paper we present the Chrooma+ approach to improve the user experience of media consumption by enriching media content with additional information from various sources in the Web. Our approach focuses on the aggregation and combination of this related information with audiovisual media. This approach involves using new HTML5 technologies and with WebVTT a new annotation format to display relevant information at definite times. Some of the advantages of this approach are the usage of a rich annotation format and extensibility to include heterogeneous information sources.
Linking and visualizing television heritage: the EUscreen virtual exhibitions and the linked open data pilot BIBAFull-Text 481-484
  Johan Oomen; Vassilis Tzouvaras; Kati Hyyppaä
The EUscreen initiative represents the European television archives and acts as a domain aggregator for Europeana, Europe's digital library, which provides access to over 20 million digitized cultural objects. The main motivation for the initiative is to provide unified access to a representative collection of television programs, secondary sources and articles, and in this way to allow students, scholars and the general public to study the history of television in its wider context. This paper explores the EUscreen activities related to novel ways to present curated content and publishing EUscreen metadata as Linked Open Data.LSNA'13 Welcome and organization

LSNA'13 keynote talks

Online social networks: beyond popularity BIBAFull-Text 489-490
  Ricardo Baeza-Yates; Diego Saez-Trumper
One of the main differences between traditional Web analysis and online Social Networks (OSNs) studies, is that in the first case the information is organized around content, while in the second case it is organized around people. While search engines have done a good job finding relevant content across billions of pages, nowadays we do not have an equivalent tool to find relevant people in OSNs. Even though an impressive amount of research has been done in this direction, there are still a lot of gaps to cover. Although the first intuition could be (and was!) search for popular people, previous research have shown that users' in-degree (e.g. number of friends or followers) is important but not enough to represent the importance and reputation of a person. Another approach is to study the content of the messages exchanged between users, trying to identify topical experts. However the computational cost of such approach -- including language diversity -- is a big limitation. In our work we take a content-agnostic approach, focusing in frequency, type, and time properties of user actions rather than content, mixing their static characteristics (social graph) and their activities (dynamic graphs). Our goal is to understand the role of popular users in OSNs, and also find "hidden important users": do popular users create new trends and cascades? Do they add value to the network? And, if they don't, who does it? Our research provides preliminary answers for these questions.
Aggregating information from the crowd and the network BIBAFull-Text 491-492
  Anirban Dasgupta
In social systems, information often exists in a dispersed manner, as individual opinions, local insights and preferences. In order to make a global decision however, we need to be able to aggregate such local pieces of information into a global description of the system. Such information aggregation problems are key in setting up crowdsourcing or human computation systems. How do we formally build and analyze such information aggregation systems? In this talk we will discuss three different vignettes based on the particular information aggregation problem and the "social system" that we are extracting the information from.
   In our first result, we will analyze a crowdsourcing system consisting of a set of users and binary choice questions. Each user has a specific reliability that determines the user's error rate in answering the questions. We show how to give an unsupervised algorithm for aggregating the user answers in order to simultaneously derive the user expertise as well as the truth values of the questions.
   Our second result will deal with the case when there is an interacting user community on a question answer forum. User preferences of quality are now expressed in terms of ("best answer" and "thumbs up/down") votes cast on each other's content. We will analyze a set of possible factors that indicate bias in user voting behavior -- these factors encompass different gaming behavior, as well as other eccentricities. We address the problem of aggregating user preferences (votes) using a supervised machine learning framework to calibrate such votes. We will see that this supervised learning method of content-agnostic vote calibration can significantly improve the performance of answer ranking and expert ranking.
   The last part of the talk will describe how it is possible to exploit local insights that users have about their friends in order to improve the efficiency of surveying in a (networked) population. We will describe the notion of "social sampling", where participants in a poll respond with a summary of their friends' putative responses to the poll. The analysis of social sampling leads to novel trade-off questions: the savings in the number of samples (roughly the average size of neighborhood of participants) vs. the systematic bias in the poll due to the network structure. We show bounds on the variances of few such estimators -- experiments on real world networks show this to be a useful paradigm in obtaining accurate information with small number of samples.
The social meanings of social networks: integrating SNA and ethnography of social networking BIBAFull-Text 493-494
  Rogério de Paula
In this talk, I examine the manifest, emic meanings of social networking in the context of social network analysis and it uses this to discuss how the confluence of social science and computational sociology can contribute to a richer understanding of how emerging social technologies shape and are shaped by people's everyday practices.
Detecting malware with graph-based methods: traffic classification, botnets, and Facebook scams BIBAFull-Text 495-496
  Michalis Faloutsos
In this talk, we highlight two topics on security from our lab. First, we address the problem of Internet traffic classification (e.g. web, filesharing, or botnet?). We present a fundamentally different approach to classifying traffic that studies the network wide behavior by modeling the interactions of users as a graph. By contrast, most previous approaches use statistics such as packet sizes and inter-packet delays. We show how our approach gives rise to novel and powerful ways to: (a) visualize the traffic, (b) model the behavior of applications, and (c) detect abnormalities and attacks. Extending this approach, we develop ENTELECHEIA, a botnet-detection method. Tests with real data suggests that our graph-based approach is very promising.
   Second, we present, MyPageKeeper, a security Facebook app, with 13K downloads, which we deployed to: (a) quantify the presence of malware on Facebook, and (b) protect end-users. We designed MyPageKeeper in a way that strikes the balance between accuracy and scalability. Our initial results are scary and interesting: (a) malware is widespread, with 49% of our users are exposed to at least one malicious post from a friend, and (b) roughly 74% of all malicious posts contain links that point back to Facebook, and thus would evade any of the current web-based filtering approaches.
Mining and analyzing the enterprise knowledge graph BIBAFull-Text 497-498
  Ido Guy
Today's enterprises hold ever-growing amounts of public data, stemming from different organizational systems, such as development environments, CRM systems, business intelligence systems, and enterprise social media. This data unlocks rich and diverse information about entities, people, terms, and the relationships among them. A lot of insight can be gained through analyzing this knowledge graph, both by individual employees and by the organization as a whole. In this talk, I will review recent work done by the Social Technologies & Analytics group at IBM Research-Haifa to mine these relationships, represent them in a generalized model, and use the model for different aims within the enterprise, including social search [5], expertise location [1], social recommendation [2, 3], and network analysis [4].
Scaling graph computations at Facebook BIBAFull-Text 499-500
  Johan Ugander
With over a billion nodes and hundreds of billions of edges, scalability is at the forefront of concerns when dealing with the Facebook social graph. This talk will focus on two recent advances in graph computations at Facebook. The first focus concerns the development of a novel graph sharding algorithm -- Balanced Label Propagation -- for load-balancing distributed graph computations. Using Balanced Label Propagation, we were able to reduce by 50% the query time of Facebook's 'People You May Know' service, the realtime distributed system responsible for the feature extraction and ranking of the friends-of-friends of all active Facebook users. The second focus concerns the 2011 computation of the average distance distribution between all active Facebook users. This computation, which produced an average distance of 4.74, was made possible by two recent computational advances: Hyper-ANF, a modern probabilistic algorithm for computing distance distributions, and Layered Label Propagation, a modern compression scheme suited for social graphs. The details of how this computation was coordinated will be described. The talk describes joint work with Lars Backstrom, Paolo Boldi, Marco Rosa, and Sebastiano Vigna.

LSNA'13 technical presentations

Towards highly scalable pregel-based graph processing platform with x10 BIBAFull-Text 501-508
  Nguyen Thien Bao; Toyotaro Suzumura
Many practical computing problems concern large graph. Standard problems include web graph analysis and social networks analysis like Facebook, Twitter. The scale of these graph poses challenge to their efficient processing. To efficiently process large-scale graph, we create X-Pregel, a graph processing system based on Google's Computing Pregel model [1], by using the state-of-the-art PGAS programming language X10. We do not purely implement Google Pregel by using X10 language, but we also introduce two new features that do not exists in the original model to optimize the performance: (1) an optimization to reduce the number of messages which is exchanged among workers, (2) a dynamic re-partitioning scheme that effectively reassign vertices to different workers during the computation. Our performance evaluation demonstrates that our optimization method of sending messages achieves up to 200% speed up on Pagerank by reducing the network I/O to 10 times in comparison with the default method of sending messages when processing SCALE20 Kronecker graph [2](vertices = 1,048,576, edges = 33,554,432). It also demonstrates that our system processes large graph faster than prior implementation of Pregel such as GPS [3](stands for graph processing system) and Giraph [4].
A first view of exedra: a domain-specific language for large graph analytics workflows BIBAFull-Text 509-516
  Miyuru Dayarathna; Toyotaro Suzumura
In recent years, many programming models, software libraries, and middleware have appeared for processing large graphs of various forms. However, there exists a significant usability gap between the graph analysis scientists, and High Performance Computing (HPC) application programmers due to the complexity of HPC graph analysis software. In this paper we provide a basic view of Exedra, a domain-specific language (DSL) for large graph analysis in which we aim to eliminate the aforementioned complexities. Exedra consists of high level language constructs for specifying different graph analysis tasks on distributed environments. We implemented Exedra DSL on a scalable graph analysis platform called Dipper. Dipper uses Igraph/R interface for creating graph analysis workflows which in turn gets translated to Exedra statements. Exedra statements are interpreted by Dipper interpreter, and gets mapped to user specified libraries/middleware. Exedra DSL allows for synthesize of graph algorithms that are more efficient compared to bare use of graph libraries while maintaining a standard interface that could use even future graph analysis software. We evaluated Exedra's feasibility for expressing graph analysis tasks by running Dipper on a cluster of four nodes. We observed that Dipper has the ability of reducing the time taken for graph analysis when the workflow was distributed on all four nodes despite the communication, and data format conversion overhead of the Dipper framework.
Analysis of large scale climate data: how well climate change models and data from real sensor networks agree? BIBAFull-Text 517-526
  Santiago A. Nunes; Luciana A. S. Romani; Ana M. H. Avila; Priscila P. Coltri; Caetano, Jr. Traina; Robson L. F. Cordeiro; Elaine P. M. de Sousa; Agma J. M. Traina
Research on global warming and climate changes has attracted a huge attention of the scientific community and of the media in general, mainly due to the social and economic impacts they pose over the entire planet. Climate change simulation models have been developed and improved to provide reliable data, which are employed to forecast effects of increasing emissions of greenhouse gases on a future global climate. The data generated by each model simulation amount to Terabytes of data, and demand fast and scalable methods to process them. In this context, we propose a new process of analysis aimed at discriminating between the temporal behavior of the data generated by climate models and the real climate observations gathered from ground-based meteorological station networks. Our approach combines fractal data analysis and the monitoring of real and model-generated data streams to detect deviations on the intrinsic correlation among the time series defined by different climate variables. Our measurements were made using series from a regional climate model and the corresponding real data from a network of sensors from meteorological stations existing in the analyzed region. The results show that our approach can correctly discriminate the data either as real or as simulated, even when statistical tests fail. Those results suggest that there is still room for improvement of the state-of-the-art climate change models, and that the fractal-based concepts may contribute for their improvement, besides being a fast, parallelizable, and scalable approach.
Model of complex networks based on citation dynamics BIBAFull-Text 527-530
  Lovro Šubelj; Marko Bajec
Complex networks of real-world systems are believed to be controlled by common phenomena, producing structures far from regular or random. These include scale-free degree distributions, small-world structure and assortative mixing by degree, which are also the properties captured by different random graph models proposed in the literature. However, many (non-social) real-world networks are in fact disassortative by degree. Thus, we here propose a simple evolving model that generates networks with most common properties of real-world networks including degree disassortativity. Furthermore, the model has a natural interpretation for citation networks with different practical applications.
How social network is evolving?: a preliminary study on billion-scale Twitter network BIBAFull-Text 531-534
  Masaru Watanabe; Toyotaro Suzumura
Recently, social network services such as Twitter, Facebook, MySpace, LinkedIn have been remarkably growing. There are various studies about social networks analysis. Haewoon Kwak performed the analysis of the Twitter network on 2009 and shows the degree of separation. However, the number of users on 2009 is about 41.7 million, the graph scale is not very large compared with the current graph. In this paper, we conduct a Twitter network analysis in terms growth by region, scale-free, reciprocity, degree of separation and diameter using Twitter user data with 469.9 million users and 28.7 billion relationships. We report that the value of degree of separation is 4.59 in current Twitter network through our experiments.MABSDA'13 Welcome and organization

MABSDA'13 technical presentations

The web as a laboratory BIBAFull-Text 539-540
  Bebo White
Insights from Web Science and Big Data Analysis have led many researchers to the conclusion that the Web not only represents an almost unlimited data store but also a remarkable multi-disciplinary laboratory environment. A new challenge is how to best leverage the potential of this experimental space. What are the procedures for defining, implementing and evaluating "Web-scale" experiments? What are acceptable measures of robustness and repeatability? What are the opportunities for experimental collaboration? What disciplines are likely to benefit from this new research model? The Web Laboratory model provides an exciting new and fertile model for future research.
Like prediction: modeling like counts by bridging Facebook pages with linked data BIBAFull-Text 541-548
  Shohei Ohsawa; Yutaka Matsuo
Recent growth of social media has produced a new market for branding of people and businesses. Facebook provides Facebook Pages (Pages in short) for public figures and businesses (we call entities) to communicate with their fans through a Like button. Because Like counts sometimes reflect the popularity of entities, techniques to increase the Like count can be a matter of interest, and might be known as social media marketing. From an academic perspective, Like counts of Pages depend not only on the popularity of the entity, but also on the popularity of semantically related entities. For example, Lady Gaga's Page has many Likes; her song "Poker Face" does too. We can infer that her next song will acquire many Likes immediately. Important questions are these: How does the Like count of Lady Gaga affect the Like count of her song? Alternatively, how does the Like count of her song constitute some fraction of the Like count of Lady Gaga herself?
   As described in this paper, we strive to reveal the mutual influences of Like counts among semantically related entities. To measure the influence of related entities, we propose a problem called the Like prediction problem (LPP). It models Like counts of a given entity using information of related entities. The semantic relations among entities, expressed as RDF predicates, are obtained by linking each Page with the most similar DBpedia entity. Using the model learned by support vector regression (SVR) on LPP, we can estimate the Like count of a new entity e.g., Lady Gaga's new song. More importantly, we can analyze which RDF predicates are important to infer Like counts, providing a mutual influence network among entities. Our study comprises three parts: (1) crawling the Pages and their Like counts, (2) linking Pages to DBpedia, and (3) constructing features to solve the LPP. Our study, based on 20 million Pages with 30 billion Likes, is the largest-scale study of Facebook Likes ever reported. This research constitutes a new attempt to integrate unstructured emotional data such as Likes, with Linked data, and to provide new insights for branding with social media.
Tower of babel: a crowdsourcing game building sentiment lexicons for resource-scarce languages BIBAFull-Text 549-556
  Yoonsung Hong; Haewoon Kwak; Youngmin Baek; Sue Moon
With the growing amount of textual data produced by online social media today, the demands for sentiment analysis are also rapidly increasing; and, this is true for worldwide. However, non-English languages often lack sentiment lexicons, a core resource in performing sentiment analysis. Our solution, Tower of Babel (ToB), is a language-independent sentiment-lexicon-generating crowdsourcing game. We conducted an experiment with 135 participants to explore the difference between our solution and a conventional manual annotation method. We evaluated ToB in terms of effectiveness, efficiency, and satisfactions. Based on the result of the evaluation, we conclude that sentiment classification via ToB is accurate, productive and enjoyable.
Rule-based opinion target and aspect extraction to acquire affective knowledge BIBAFull-Text 557-564
  Stefan Gindl; Albert Weichselbraun; Arno Scharl
Opinion holder and opinion target extraction are among the most popular and challenging problems tackled by opinion mining researchers, recognizing the significant business value of such components and their importance for applications such as media monitoring and Web intelligence. This paper describes an approach that combines opinion target extraction with aspect extraction using syntactic patterns. It expands previous work limited by sentence boundaries and includes a heuristic for anaphora resolution to identify targets across sentences. Furthermore, it demonstrates the application of concepts known from research on open information extraction to the identification of relevant opinion aspects. Qualitative analyses performed on a corpus of 100,000 Amazon product reviews show that the approach is promising. The extracted opinion targets and aspects are useful for enriching common knowledge resources and opinion mining ontologies, and support practitioners and researchers to identify opinions in document collections.
A graph-based approach to commonsense concept extraction and semantic similarity detection BIBAFull-Text 565-570
  Dheeraj Rajagopal; Erik Cambria; Daniel Olsher; Kenneth Kwok
Commonsense knowledge representation and reasoning support a wide variety of potential applications in fields such as document auto-categorization, Web search enhancement, topic gisting, social process modeling, and concept-level opinion and sentiment analysis. Solutions to these problems, however, demand robust knowledge bases capable of supporting flexible, nuanced reasoning. Populating such knowledge bases is highly time-consuming, making it necessary to develop techniques for deconstructing natural language texts into commonsense concepts. In this work, we propose an approach for effective multi-word commonsense expression extraction from unrestricted English text, in addition to a semantic similarity detection technique allowing additional matches to be found for specific concepts not already present in knowledge bases.
Spanish knowledge base generation for polarity classification from masses BIBAFull-Text 571-578
  Arturo Montejo-Ráez; Manuel Carlos Díaz-Galiano; José Manuel Perea-Ortega; Luis Alfonso Ureña-López
This work presents a novel method for the generation of a knowledge base oriented to Sentiment Analysis from the continuous stream of published micro-blogs in social media services like Twitter. The method is simple in its approach and has shown to be effective compared to other knowledge based methods for Polarity Classification. Due to independence from language, the method has been tested on different Spanish corpora, with a minimal effort in the lexical resources involved. Although for two of the three studied corpora the obtained results did not improve those officially obtained on the same corpora, it should be noted that this is an unsupervised approach and the accuracy levels achieved were close to those levels obtained with well-known supervised algorithms.
Revised mutual information approach for German text sentiment classification BIBAFull-Text 579-586
  Farag Saad; Brigitte Mathiak
The significant increase in content of online social media such as product reviews, blogs, forums etc., have led to an increasing attention to sentiment analysis tools and approaches that make use of mining this substantially growing content. The aim of this paper is to develop a robust classification approach of customer reviews based on a self-annotated domain-specific corpus by applying a statistical approach i.e., mutual information. First, subjective words in each test sentence are identified. Second, ambiguous adjectives such as high, low, large, many etc., are disambiguated based on their accompanying noun using a conditional mutual information approach. Third, a mutual information approach is applied to find the sentiment orientation (polarity) of the identified subjective words based on analyzing their statistical relationship with the manually annotated sentiment labels within a sizeable sentiment training data. Fourth, since negation plays a significant role in flipping the sentiment polarity of an identified sentiment word, we estimate the role of negation in affecting the classification accuracy. Finally, the identified polarity for each test sentence is evaluated against experts' annotation.MSM'13 Welcome and organization

MSM'13 keynote talk

Urban: crowdsourcing for the good of London BIBAFull-Text 591-592
  Daniele Quercia
For the last few years, we have been studying existing social media sites and created new ones in the context of London. By combining what Twitter users in a variety of London neighborhoods talk about with census data, we showed that neighborhood deprivation was associated (positively and negatively) with use of emotion words (sentiment) [2] and with specific topics [5]. Users in more deprived neighborhoods tweeted about wedding parties, matters expressed in Spanish/Portuguese, and celebrity gossips. By contrast, those in less deprived neighborhoods tweeted about vacations, professional use of social media, environmental issues, sports, and health issues. Also, upon data about 76 million London underground and overground rail journeys, we found that people from deprived areas visited both other deprived areas and prosperous areas, while residents of better-off communities tended to only visit other privileged neighborhoods -- suggesting a geographic segregation effect [1, 6]. More recently, we created and launched two crowdsourcing websites. First, we launched urbanopticon.org, which extracts Londoners' mental images of the city. By testing which places are remarkable and unmistakable and which places represent faceless sprawl, we were able to draw the recognizability map of London. We found that areas with low recognizability did not fare any worse on the economic indicators of income, education, and employment, but they did significantly suffer from social problems of housing deprivation, poor living conditions, and crime [4]. Second, we launched urbangems.org. This crowdsources visual perceptions of quiet, beauty and happiness across the city using Google Street View pictures.
   The aim is to identify the visual cues that are generally associated with concepts difficult to define such beauty, happiness, quietness, or even deprivation. By using state-of-the-art image processing techniques, we determined the visual cues that make a place appear beautiful, quiet, and happy [3]: the amount of greenery was the most positively associated visual cue with each of three qualities; by contrast, broad streets, fortress-like buildings, and council houses tended to be negatively associated. These two sites offer the ability to conduct specific urban sociological experiments at scale. More generally, this line of work is at the crossroad of two emerging themes in computing research -- a crossroad where "web science" meets the "smart city" agenda.

MSM'13 machine learning & statistical analysis

Using topic models for Twitter hashtag recommendation BIBAFull-Text 593-596
  Fréderic Godin; Viktor Slavkovikj; Wesley De Neve; Benjamin Schrauwen; Rik Van de Walle
Since the introduction of microblogging services, there has been a continuous growth of short-text social networking on the Internet. With the generation of large amounts of microposts, there is a need for effective categorization and search of the data. Twitter, one of the largest microblogging sites, allows users to make use of hashtags to categorize their posts. However, the majority of tweets do not contain tags, which hinders the quality of the search results. In this paper, we propose a novel method for unsupervised and content-based hashtag recommendation for tweets. Our approach relies on Latent Dirichlet Allocation (LDA) to model the underlying topic assignment of language classified tweets. The advantage of our approach is the use of a topic distribution to recommend general hashtags.
FS-NER: a lightweight filter-stream approach to named entity recognition on Twitter data BIBAFull-Text 597-604
  Diego Marinho de Oliveira; Alberto H. F. Laender; Adriano Veloso; Altigran S. da Silva
Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data for web search and mining applications. Tasks such as Named Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messages are terse, poorly worded and posted in many different languages. Also, Twitter follows a streaming paradigm, imposing that entities must be recognized in real-time. In view of these challenges and the inappropriateness of existing tools, we propose a novel approach for Named Entity Recognition on Twitter data called FS-NER (Filter-Stream Named Entity Recognition). FS-NER is characterized by the use of filters that process unlabeled Twitter messages, being much more practical than existing supervised CRF-based approaches. Such filters can be combined either in sequence or in parallel in a flexible way. Moreover, because these filters are not language dependent, FS-NER can be applied to different languages without requiring a laborious adaptation. Through a systematic evaluation using three Twitter collections and considering seven types of entity, we show that FS-NER performs 3% better than a CRF-based baseline, besides being orders of magnitude faster and much more practical.

MSM'13 trend & topic detection in microposts

Nerding out on Twitter: fun, patriotism and #curiosity BIBAFull-Text 605-612
  Victoria Uren; Aba-Sah Dadzie
This paper presents an analysis of tweets collected over six days before, during and after the landing of the Mars Science Laboratory, known as Curiosity, in the Gale Crater on the 6th of August 2012. A sociological application of web science is demonstrated by use of parallel coordinate visualization as part of a mixed methods study. The results show strong, predominantly positive, international interest in the event. Scientific details dominated the stream, but, following the successful landing, other themes emerged such as fun, and national pride.
ET: events from tweets BIBAFull-Text 613-620
  Ruchi Parikh; Kamalakar Karlapalem
Social media sites such as Twitter and Facebook have emerged as popular tools for people to express their opinions on various topics. The large amount of data provided by these media is extremely valuable for mining trending topics and events. In this paper, we build an efficient, scalable system to detect events from tweets (ET). Our approach detects events by exploring their textual and temporal components. ET does not require any target entity or domain knowledge to be specified; it automatically detects events from a set of tweets. The key components of ET are (1) an extraction scheme for event representative keywords, (2) an efficient storage mechanism to store their appearance patterns, and (3) a hierarchical clustering technique based on the common co-occurring features of keywords. The events are determined through the hierarchical clustering process. We evaluate our system on two data-sets; one is provided by VAST challenge 2011, and the other published by US based users in January 2013. Our results show that we are able to detect events of relevance efficiently.

MSM'13 filtering & classification of microposts

Meaning as collective use: predicting semantic hashtag categories on Twitter BIBAFull-Text 621-628
  Lisa Posch; Claudia Wagner; Philipp Singer; Markus Strohmaier
This paper sets out to explore whether data about the usage of hashtags on Twitter contains information about their semantics. Towards that end, we perform initial statistical hypothesis tests to quantify the association between usage patterns and semantics of hashtags. To assess the utility of pragmatic features -- which describe how a hashtag is used over time -- for semantic analysis of hashtags, we conduct various hashtag stream classification experiments and compare their utility with the utility of lexical features. Our results indicate that pragmatic features indeed contain valuable information for classifying hashtags into semantic categories. Although pragmatic features do not outperform lexical features in our experiments, we argue that pragmatic features are important and relevant for settings in which textual information might be sparse or absent (e.g., in social video streams).
Towards linking buyers and sellers: detecting commercial Intent on Twitter BIBAFull-Text 629-632
  Bernd Hollerit; Mark Kröll; Markus Strohmaier
Since more and more people use the micro-blogging platform Twitter to convey their needs and desires, it has become a particularly interesting medium for the task of identifying commercial activities. Potential buyers and sellers can be contacted directly thereby opening up novel perspectives and economic possibilities. By detecting commercial intent in tweets, this work is considered a first step to bring together buyers and sellers. In this work, we present an automatic method for detecting commercial intent in tweets where we achieve reasonable precision 57% and recall 77% scores. In addition, we provide insights into the nature and characteristics of tweets exhibiting commercial intent thereby contributing to our understanding of how people express commercial activities on Twitter.

MSM'13 posters & demonstrations

MicroFilter: real time filtering of microblogging content BIBAFull-Text 633-634
  Ryadh Dahimene; Cédric du Mouza
Microblogging systems have become a major trend over the Web. After only 7 years of existence, Twitter for instance claims more than 500 million users with more than 350 billion delivered update each day. As a consequence the user must today manage possibly extremely large feeds, resulting in poor data readability and loss of valuable information and the system must face a huge network load. In this demonstration, we present and illustrate the features of MicroFilter (MF in the following), an inverted list-based filtering engine that nicely extends existing centralized microblogging systems by adding a real-time filtering feature. The demonstration proposed illustrates how the user experience is improved, the impact on the traffic for the overall system, and how the characteristics of microblogs drove the design of the indexing structures.
Some clues on irony detection in tweets BIBAFull-Text 635-636
  Aline A. Vanin; Larissa A. Freitas; Renata Vieira; Marco Bochernitsan
MSND'13 Welcome

MSND'13 technical presentations

Detection of spam tipping behaviour on foursquare BIBAFull-Text 641-648
  Anupama Aggarwal; Jussara Almeida; Ponnurangam Kumaraguru
In Foursquare, one of the currently most popular online location based social networking sites (LBSNs), users may not only check-in at specific venues but also post comments (or tips), sharing their opinions and previous experiences at the corresponding physical places. Foursquare tips, which are visible to everyone, provide venue owners with valuable user feedback besides helping other users to make an opinion about the specific venue. However, they have been the target of spamming activity by users who exploit this feature to spread tips with unrelated content.
   In this paper, we present what, to our knowledge, is the first effort to identify and analyze different patterns of tip spamming activity in Foursquare, with the goal of developing automatic tools to detect users who post spam tips -- tip spammers. A manual investigation of a real dataset collected from Foursquare led us to identify four categories of spamming behavior, viz. Advertising/Spam, Self-promotion, Abusive and Malicious. We then applied machine learning techniques, jointly with a selected set of user, social and tip's content features associated with each user, to develop automatic detection tools. Our experimental results indicate that we are able to not only correctly distinguish legitimate users from tip spammers with high accuracy (89.76%) but also correctly identify a large fraction (at least 78.88%) of spammers in each identified category.
The role of research leaders on the evolution of scientific communities BIBAFull-Text 649-656
  Bruno Leite Alves; Fabrício Benevenuto; Alberto H. F. Laender
There have been considerable efforts in the literature towards understanding and modeling dynamic aspects of scientific communities. Despite the great interest, little is known about the role that different members play in the formation of the underlying network structure of such communities. In this paper, we provide a wide investigation of the roles that members of the core of scientific communities play in the collaboration network structure formation and evolution. To do that, we define a community core based on an individual metric, core score, which is an h-index derived metric that captures both, the prolificness and the involvement of researchers in a community. Our results provide a number of key observations related to community formation and evolving patterns. Particularly, we show that members of the community core work as bridges that connect smaller clustered research groups. Furthermore, these members are responsible for an increase in the average degree of the whole community underlying network and a decrease on the overall network assortativeness. More important, we note that variations on the members of the community core tend to be strongly correlated with variations on these metrics. We argue that our observations are important for shedding a light on the role of key members on community formation and structure.
Analyzing and predicting viral tweets BIBAFull-Text 657-664
  Maximilian Jenders; Gjergji Kasneci; Felix Naumann
Twitter and other microblogging services have become indispensable sources of information in today's web. Understanding the main factors that make certain pieces of information spread quickly in these platforms can be decisive for the analysis of opinion formation and many other opinion mining tasks.
   This paper addresses important questions concerning the spread of information on Twitter. What makes Twitter users retweet a tweet? Is it possible to predict whether a tweet will become "viral", i.e., will be frequently retweeted? To answer these questions we provide an extensive analysis of a wide range of tweet and user features regarding their influence on the spread of tweets. The most impactful features are chosen to build a learning model that predicts viral tweets with high accuracy. All experiments are performed on a real-world dataset, extracted through a public Twitter API based on user IDs from the TREC 2011 microblog corpus.
Resolving homonymy with correlation clustering in scholarly digital libraries BIBAFull-Text 665-672
  Jeongin Ju; Hosung Park; Sue Moon
As scholarly data increases rapidly, scholarly digital libraries, supplying publication data through convenient online interfaces, become popular and important tools for researchers. Researchers use SDLs for various purposes, including searching the publications of an author, assessing one's impact by the citations, and identifying one's research topics. However, common names among authors cause difficulties in correctly identifying one's works among a large number of scholarly publications. Abbreviated first and middle names make it even harder to identify and distinguish authors with the same representation (i.e. spelling) of names. Several disambiguation methods have solved the problem under their own assumptions. The assumptions are usually that inputs such as the number of same-named authors, training sets, or rich and clear information about papers are given. Considering the size of scholarship records today and their inconsistent formats, we expect their assumptions be very hard to be met. We use common assumption that coauthors are likely to write more than one paper together and propose an unsupervised approach to group papers from the same author only using the most common information, author lists. We represent each paper as a point in an author name space, take dimension reduction to find author names shown frequently together in papers, and cluster papers with vector similarity measure well fitted for name disambiguation task. The main advantage of our approach is to use only coauthor information as input. We evaluate our method using publication records collected from DBLP, and show that our approach results in better disambiguation compared to other five clustering methods in terms of cluster purity and fragmentation.
Examining lists on Twitter to uncover relationships between following, membership and subscription BIBAFull-Text 673-676
  Srikar Velichety; Sudha Ram
We report on an exploratory analysis of pairwise relationships between three different forms of information consumption on Twitter viz., following, listing and subscribing. We develop a systematic framework to examine the relationships between these three forms. Using our framework, we conducted an empirical analysis of a dataset from Twitter. Our results show that people not only consume information by explicitly following others, but also by listing and subscribing to lists and that the people they list or subscribe to are not the same as the ones they follow. Our work has implications for understanding information propagation and diffusion via Twitter and for generating recommendations for adding users to lists, subscribing and merging or splitting them.PHDA'13 Welcome and organization

PHDA'13 technical presentations

A proposal for automatic diagnosis of malaria: extended abstract BIBAFull-Text 681-682
  Allisson D. Oliveira; Giordano Cabral; D. López; Caetano Firmo; F. Zarzuela Serrat; J. Albuquerque
This paper presents a methodology for automatic diagnosis of malaria using computer vision techniques combined with artificial intelligence. We had obtained an accuracy rate of 74% in the detection system.
Vaccine attitude surveillance using semantic analysis: constructing a semantically annotated corpus BIBAFull-Text 683-686
  Stephanie Brien; Nona Naderi; Arash Shaban-Nejad; Luke Mondor; Doerthe Kroemker; David L. Buckeridge
This paper reports work in progress to semantically annotate blog posts about vaccines to use in the Vaccine Attitude Surveillance using Semantic Analysis (VASSA) framework. The VASSA framework combines semantic web and natural language processing (NLP) tools and techniques to provide a coherent semantic layer across online social media for assessment and analysis of vaccination attitudes and beliefs. We describe how the blog posts were sampled and selected, our schema to semantically annotate concepts defined in our ontology, details of the annotation process, and inter-annotator agreement on a sample of blog posts.
A roadmap to integrated digital public health surveillance: the vision and the challenges BIBAFull-Text 687-694
  Patty Kostkova
The exponentially increasing stream of real time big data produced by Web 2.0 Internet and mobile networks created radically new interdisciplinary challenges for public health and computer science. Traditional public health disease surveillance systems have to utilize the potential created by new situation-aware realtime signals from social media, mobile/sensor networks and citizens? participatory surveillance systems providing invaluable free realtime event-based signals for epidemic intelligence. However, rather than improving existing isolated systems, an integrated solution bringing together existing epidemic intelligence systems scanning news media (e.g., GPHIN, MedISys) with real-time social media intelligence (e.g., Twitter, participatory systems) is required to substantially improve and automate early warning, outbreak detection and preparedness operations. However, automatic monitoring and novel verification methods for these multichannel event-based real time signals has to be integrated with traditional case-based surveillance systems from microbiological laboratories and clinical reporting. Finally, the system needs effectively support coordination of epidemiological teams, risk communication with citizens and implementation of prevention measures.
   However, from computational perspective, signal detection, analysis and verification of very high noise realtime big data provide a number of interdisciplinary challenges for computer science. Novel approaches integrating current systems into a digital public health dashboard can enhance signal verification methods and automate the processes assisting public health experts in providing better informed and more timely response. In this paper, we describe the roadmap to such a system, components of an integrated public health surveillance services and computing challenges to be resolved to create an integrated real world solution.
Participatory disease surveillance in Latin America BIBAFull-Text 695-696
  Michael Johansson; Oktawia Wojcik; Rumi Chunara; Mark Smolinski; John Brownstein
Participatory disease surveillance systems are dynamic, sensitive, and accurate. They also offer an opportunity to directly connect the public to public health. Implementing them in Latin America requires targeting multiple acute febrile illnesses, designing a system that is appropriate and scalable, and developing local strategies for encouraging participation.
Crowdsourced risk factors of influenza-like-illness in Mexico BIBAFull-Text 697-698
  Natalia Barbara Mantilla-Beniers; Rocio Rodriguez-Ramirez; Christopher Rhodes Stephens
Monitoring of influenza like illnesses (ILI) using the Internet has become more common since its beginnings nearly a decade ago. The initial project of Der Grote Griep Meting was launched in 2003 in the Netherlands and Belgium. It was designed as a means of engaging people in matters of scientific and public health importance, and indeed attracted participation from over 30,000 people in its first year. Its success thus gathered a wealth of potentially valuable epidemiological data complementary to those obtained through the established disease surveillance networks, and linked to rich background information on each participant. Since then, there has been an accelerated increase in the number of countries hosting similar websites, and many of these have generated rather promising results
   In this talk, an analysis of the data from the Mexican monitoring website, "Reporta" is presented, and the risk factors that are linked to reporting of ILI symptoms among its participants are determined and analyzed. The data base gathered from the launching of Reporta in May 2009 to September 2011 is used for this purpose. The definition of suspect ILI case employed by the Mexican Health Ministry is applied to distinguish a class C of participants; the traits gathered in the background questionnaire are labeled Xi. Risk associated to any given trait Xi is evaluated by considering the difference between the frequency with which C occurs among participants with trait Xi and in the general population. This difference is then normalized to assess its statistical significance
   Interestingly, while some of the results confirm the suspected importance of certain traits indicative of enhanced susceptibility or a large contact network, others are unexpected and must be interpreted within an adequate framework. Thus, a taxonomy of background traits is proposed to aid interpretation, and tested through a new assessment of the associated risks. This work illustrates a way in which Internet-based monitoring can contribute to our understanding of disease spread.
Validating models for disease detection using Twitter BIBAFull-Text 699-702
  Todd Bodnar; Marcel Salathé
Data mining social media has become a valuable resource for infectious disease surveillance. However, there are considerable risks associated with incorrectly predicting an epidemic. The large amount of social media data combined with the small amount of ground truth data and the general dynamics of infectious diseases present unique challenges when evaluating model performance. In this paper, we look at several methods that have been used to assess influenza prevalence using Twitter. We then validate them with tests that are designed to avoid and illustrate issues with the standard k-fold cross validation method. We also find that small modifications to the way that data are partitioned can have major effects on a model's reported performance.
Combining Twitter and media reports on public health events in MedISys BIBAFull-Text 703-718
  Erik van der Goot; Hristo Tanev; Jens P. Linge
We describe the harvesting and subsequent analysis of tweets that are linked to media reports on public health events in order to identify which Internet resources are being referred to in these tweets. The aim was to automatically detect resources that are traditionally not considered mainstream media, but play a role in the discussion of public health events on the Internet. Interestingly, our initial evaluation of the results showed that most references related to public health events lead to traditional news media sites, even though URLs to non-traditional media receive a higher rank. We will briefly describe the Medical Information System (MedISys) and the methodology used to obtain and analyse tweets.PSOM'13 Welcome and organization

PSOM'13 technical presentations

Preserving user privacy from third-party applications in online social networks BIBAFull-Text 723-728
  Yuan Cheng; Jaehong Park; Ravi Sandhu
Online social networks (OSNs) facilitate many third-party applications (TPAs) that offer users additional functionality and services. However, they also pose serious user privacy risk as current OSNs provide little control over disclosure of user data to TPAs. Addressing the privacy and security issues related to TPAs (and the underlying social networking platforms) requires solutions beyond a simple all-or-nothing strategy. In this paper, we outline an access control framework that provides users flexible controls over how TPAs can access user data and activities in OSNs while still retaining the functionality of TPAs. The proposed framework specifically allows TPAs to utilize some private data without actually transmitting this data to TPAs. Our approach determines access from TPAs based on user-specified policies in terms of relationships between the user and the application.
Faking Sandy: characterizing and identifying fake images on Twitter during Hurricane Sandy BIBAFull-Text 729-736
  Aditi Gupta; Hemank Lamba; Ponnurangam Kumaraguru; Anupam Joshi
In today's world, online social media plays a vital role during real world events, especially crisis events. There are both positive and negative effects of social media coverage of events, it can be used by authorities for effective disaster management or by malicious entities to spread rumors and fake news. The aim of this paper, is to highlight the role of Twitter, during Hurricane Sandy (2012) to spread fake images about the disaster. We identified 10,350 unique tweets containing fake images that were circulated on Twitter, during Hurricane Sandy. We performed a characterization analysis, to understand the temporal, social reputation and influence patterns for the spread of fake images. Eighty six percent of tweets spreading the fake images were retweets, hence very few were original tweets. Our results showed that top thirty users out of 10,215 users (0.3%) resulted in 90% of the retweets of fake images; also network links such as follower relationships of Twitter, contributed very less (only 11%) to the spread of these fake photos URLs. Next, we used classification models, to distinguish fake images from real images of Hurricane Sandy. Best results were obtained from Decision Tree classifier, we got 97% accuracy in predicting fake images from real. Also, tweet based features were very effective in distinguishing fake images tweets from real, while the performance of user based features was very poor. Our results, showed that, automated techniques can be used in identifying real images from fake images posted on Twitter.
A pilot study of cyber security and privacy related behavior and personality traits BIBAFull-Text 737-744
  Tzipora Halevi; James Lewis; Nasir Memon
Recent research has begun to focus on the factors that cause people to respond to phishing attacks as well as affect user behavior on social networks. This study examines the correlation between the Big Five personality traits and email phishing response. Another aspect examined is how these factors relate to users' tendency to share information and protect their privacy on Facebook (which is one of the most popular social networking sites).
   This research shows that when using a prize phishing email, neuroticism is the factor most correlated to responding to this email, in addition to a gender-based difference in the response. This study also found that people who score high on the openness factor tend to both post more information on Facebook as well as have less strict privacy settings, which may cause them to be susceptible to privacy attacks. In addition, this work detected no correlation between the participants estimate of being vulnerable to phishing attacks and actually being phished, which suggests susceptibility to phishing is not due to lack of awareness of the phishing risks and that real-time response to phishing is hard to predict in advance by online users.
   The goal of this study is to better understand the traits that contribute to online vulnerability, for the purpose of developing customized user interfaces and secure awareness education, designed to increase users' privacy and security in the future.
Twitter (R)evolution: privacy, free speech and disclosure BIBAFull-Text 745-750
  Lilian Edwards; Andrea M. Matwyshyn
Using Twitter as a case study, this paper sets forth the legal tensions faced by social networks that seek to defend privacy interests of users. Recent EC and UN initiatives have begun to suggest an increased role for corporations as protectors of human rights. But, as yet, binding rather than voluntary obligations of this kind under international human rights law seem either non-existent or highly conflicted, and structural limitations to such a shift may currently exist under both U.S. and UK law. Companies do not face decisions regarding disclosure in a vacuum, rather they face them constrained by existing obligations under (sometimes conflicting) legal demands. Yet, companies such as Twitter are well-positioned to be advocates for consumers' interests in these legal debates. Using several recent corporate disclosure decisions regarding user identity as illustration, this paper places questions of privacy, free speech and disclosure in broader legal context. More scholarship is needed on the mechanics of how online intermediaries, especially social media, manage their position as crucial speech platforms in democratic as well as less democratic regimes.
How to hack into Facebook without being a hacker BIBAFull-Text 751-754
  Tarun Parwani; Ramin Kholoussi; Panagiotis Karras
The proliferation of online social networking services has aroused privacy concerns among the general public. The focus of such concerns has typically revolved around providing explicit privacy guarantees to users and letting users take control of the privacy-threatening aspects of their online behavior, so as to ensure that private personal information and materials are not made available to other parties and not used for unintended purposes without the user's consent. As such protective features are usually opt-in, users have to explicitly opt-in for them in order to avoid compromising their privacy. Besides, third-party applications may acquire a user's personal information, but only after they have been granted consent by the user. If we also consider potential network security attacks that intercept or misdirect a user's online communication, it would appear that the discussion of user vulnerability has accurately delimited the ways in which a user may be exposed to privacy threats.
   In this paper, we expose and discuss a previously unconsidered avenue by which a user's privacy can be gravely exposed. Using this exploit, we were able to gain complete access to some popular online social network accounts without using any conventional method like phishing, brute force, or trojans. Our attack merely involves a legitimate exploitation of the vulnerability created by the existence of obsolete web-based email addresses. We present the results of an experimental study on the spread that such an attack can reach, and the ethical dilemmas we faced in the process. Last, we outline our suggestions for defense mechanisms that can be employed to enhance online security and thwart the kind of attacks that we expose.
A cross-cultural framework for protecting user privacy in online social media BIBAFull-Text 755-762
  Blase Ur; Yang Wang
Social media has become truly global in recent years. We argue that support for users' privacy, however, has not been extended equally to all users from around the world. In this paper, we survey existing literature on cross-cultural privacy issues, giving particular weight to work specific to online social networking sites. We then propose a framework for evaluating the extent to which social networking sites' privacy options are offered and communicated in a manner that supports diverse users from around the world. One aspect of our framework focuses on cultural issues, such as norms regarding the use of pseudonyms or posting of photographs. A second aspect of our framework discusses legal issues in cross-cultural privacy, including data-protection requirements and questions of jurisdiction. The final part of our framework delves into user expectations regarding the data-sharing practices and the communication of privacy information. The framework can enable service providers to identify potential gaps in support for user privacy. It can also help researchers, regulators, or consumer advocates reason systematically about cultural differences related to privacy in social media.
Privacy nudges for social media: an exploratory Facebook study BIBAFull-Text 763-770
  Yang Wang; Pedro Giovanni Leon; Kevin Scott; Xiaoxuan Chen; Alessandro Acquisti; Lorrie Faith Cranor
Anecdotal evidence and scholarly research have shown that a significant portion of Internet users experience regrets over their online disclosures. To help individuals avoid regrettable online disclosures, we employed lessons from behavioral decision research and research on soft paternalism to design mechanisms that "nudge" users to consider the content and context of their online disclosures before posting them. We developed three such privacy nudges on Facebook. The first nudge provides visual cues about the audience for a post. The second nudge introduces time delays before a post is published. The third nudge gives users feedback about their posts. We tested the nudges in a three-week exploratory field trial with 21 Facebook users, and conducted 13 follow-up interviews. Our system logs, results from exit surveys, and interviews suggest that privacy nudges could be a promising way to prevent unintended disclosure. We discuss limitations of the current nudge designs and future directions for improvement.RAMSS'13 Welcome and organization

RAMSS'13 keynote talks

Real-time user modeling and prediction: examples from YouTube BIBAFull-Text 775-776
  Ramesh R. Sarukkai
Real-time analysis and modeling of users for improving engagement, and interaction is a burgeoning area of interest with applications to web sites, social networks and mobile applications. Apart from scalability issues, this domain poses a number of modeling and algorithmic challenges. In this talk, as an illustrative example, we present DAL, a system that leverages real-time user activity/signals for dynamic ad loads, and designed to improve the overall user experience on YouTube. This system uses machine learning to optimize for user activity during a visit and helps decide on real-time advertising policies dynamically for the user. We conclude the talk with challenges and opportunities in this important area of real-time user analysis and social modeling.
SAMOA: a platform for mining big data streams BIBAFull-Text 777-778
  Gianmarco De Francisci Morales
Social media and user generated content are causing an ever growing data deluge. The rate at which we produce data is growing steadily, thus creating larger and larger streams of continuously evolving data. Online news, micro-blogs, search queries are just a few examples of these continuous streams of user activities. The value of these streams relies in their freshness and relatedness to ongoing events. However, current (de-facto standard) solutions for big data analysis are not designed to deal with evolving streams.
   In this talk, we offer a sneak preview of SAMOA, an upcoming platform for mining dig data streams. SAMOA is a platform for online mining in a cluster/cloud environment. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as S4 and Storm. SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. Finally, SAMOA will soon be open sourced in order to foster collaboration and research on big data stream mining.

RAMSS'13 session 1

Towards real-time collaborative filtering for big fast data BIBAFull-Text 779-780
  Ernesto Diaz-Aviles; Wolfgang Nejdl; Lucas Drumond; Lars Schmidt-Thieme
The Web of people is highly dynamic and the life experiences between our on-line and "real-world" interactions are increasingly interconnected. For example, users engaged in the Social Web more and more rely upon continuous social streams for real-time access to information and fresh knowledge about current affairs. However, given the deluge of data items, it is a challenge for individuals to find relevant and appropriately ranked information at the right time. Having Twitter as test bed, we tackle this information overload problem by following an online collaborative approach. That is, we go beyond the general perspective of information finding in Twitter, that asks: "What is happening right now?", towards an individual user perspective, and ask: "What is interesting to me right now within the social media stream?". In this paper, we review our recently proposed online collaborative filtering algorithms and outline potential research directions.
Detecting real-time burst topics in microblog streams: how sentiment can help BIBAFull-Text 781-782
  Lumin Zhang; Yan Jia; Bin Zhou; Yi Han
Microblog has become an increasing valuable resource of up-to-date topics about what is happening in the world. In this paper, we propose a novel approach of detecting real-time events in microblog streams based on bursty sentiments detection. Instead of traditional sentiment orientation like positive, negative and neutral, we use sentiment vector as our sentiment model to abstract subjective messages which are then used to detect bursts and clustered into new events. Experimental evaluations show that our approach could perform effectively for online event detection. Although we worked with Chinese in our research, the technique can be used with any other language.
Sub-event detection during natural hazards using features of social media data BIBAFull-Text 783-788
  Dhekar Abhik; Durga Toshniwal
Social networking sites such as Flickr, YouTube, Facebook, etc. contain a huge amount of user-contributed data for a variety of real-world events. These events can be some natural calamities such as earthquakes, floods, forest fires, etc. or some man-made hazards like riots. This work focuses on getting better knowledge about a natural hazard event using the data available from social networking sites. Rescue and relief activities in emergency situations can be enhanced by identifying sub-events of a particular event. Traditional topic discovery techniques used for event identification in news data cannot be used for social media data because social network data may be unstructured. To address this problem the features or metadata associated with social media data can be exploited. These features can be user-provided annotations (e.g., title, description) and automatically generated information (e.g., content creation time). Considerable improvement in performance is observed by using multiple features of social media data for sub-event detection rather than using individual feature. Proposed here is a two-step process. In the first step, clusters are formed from social network data using relevant features individually. Based on the significance of features weights are assigned to them. And in the second step all the clustering solutions formed in the first step are combined in a principal weighted manner to give the final clustering solution. Each cluster represents a sub-event for a particular natural hazard.

RAMSS'13 session 2

MediaFinder: collect, enrich and visualize media memes shared by the crowd BIBAFull-Text 789-790
  Raphaël Troncy; Vuk Milicic; Giuseppe Rizzo; José Luis Redondo García
Social networks play an increasingly important role for sharing media items related to human's activities, feelings, emotions and conversations opening a window to the world in real-time. However, these images and videos are spread over multiple social networks. In this paper, we first describe a so-called media server that collect recent images and videos which can be potentially attached to an event. These media items can then be used for the automatic generation of visual summaries. However, making sense out of the resulting media galleries is an extremely challenging task. We present a framework that leverages on: (i) visual features from media items for near-deduplication and (ii) textual features from status updates to enrich, cluster and generate storyboards. A prototype is publicly available at http://mediafinder.eurecom.fr.
MJ no more: using concurrent wikipedia edit spikes with social network plausibility checks for breaking news detection BIBAFull-Text 791-794
  Thomas Steiner; Seth van Hooland; Ed Summers
We have developed an application called Wikipedia Live Monitor that monitors article edits on different language versions of Wikipedia -- as they happen in realtime. Wikipedia articles in different languages are highly interlinked. For example, the English article "en:2013_Russian_meteor_event" on the topic of the February 15 meteoroid that exploded over the region of Chelyabinsk Oblast, Russia, is interlinked with "ru:ПaДehne_meteopnta_ha_Ypajie_B_2013_roДy?, the Russian article on the same topic. As we monitor multiple language versions of Wikipedia in parallel, we can exploit this fact to detect concurrent edit spikes of Wikipedia articles covering the same topics, both in only one, and in different languages. We treat such concurrent edit spikes as signals for potential breaking news events, whose plausibility we then check with full-text cross-language searches on multiple social networks. Unlike the reverse approach of monitoring social networks first, and potentially checking plausibility on Wikipedia second, the approach proposed in this paper has the advantage of being less prone to false-positive alerts, while being equally sensitive to true-positive events, however, at only a fraction of the processing cost. A live demo of our application is available online at the URL http://wikipedia-irc.herokuapp.com/, the source code is available under the terms of the Apache 2.0 license at https://github.com/tomayac/wikipedia-irc.
Real time discussion retrieval from Twitter BIBAFull-Text 795-800
  Dmitrijs Milajevs; Gosse Bouma
While social media receive a lot of attention from the scientific community in general, there is little work on high recall retrieval of messages relevant to a discussion. Hash tag based search is widely used for data retrieval from social media. This work shows limitations of this approach, because the majority of the relevant messages do not even contain any hash tag, and unpredictable hash tags are used as the conversation evolves in time. To overcome these limitations, we propose an alternative retrieval method. Given an input stream of messages as an example of the discussion, our method extracts the most relevant words from it and queries the social network for more messages with these words. Our method filters messages that do not belong to the discussion using an LDA topic model. We demonstrate this concept on manually built collections of tweets about major sport and music events.SIMPLEX'13 Welcome and organization

SIMPLEX'13 technical session 1

Characterizing branching processes from sampled data BIBAFull-Text 805-812
  Fabricio Murai; Bruno Ribeiro; Donald Towsley; Krista Gile
Branching processes model the evolution of populations of agents that randomly generate offspring (children). These processes, more patently Galton-Watson processes, are widely used to model biological, social, cognitive, and technological phenomena, such as the diffusion of ideas, knowledge, chain letters, viruses, and the evolution of humans through their Y-chromosome DNA or mitochondrial RNA. A practical challenge of modeling real phenomena using a Galton-Watson process is the choice of the offspring distribution, which must be measured from the population. In most cases, however, directly measuring the offspring distribution is unrealistic due to lack of resources or the death of agents. So far, researchers have relied on informed guesses to guide their choice of offspring distribution. In this work we propose two methods to estimate the offspring distribution from real sampled data. Using a small sampled fraction of the agents and instrumented with the identity of the ancestors of the sampled agents, we show that accurate offspring distribution estimates can be obtained by sampling as little as 14% of the population.
Resilience of dynamic overlays through local interactions BIBAFull-Text 813-820
  Stefano Ferretti
This paper presents a self-organizing protocol for dynamic (unstructured P2P) overlay networks, which allows to react to the variability of node arrivals and departures. Through local interactions, the protocol avoids that the departure of nodes causes a partitioning of the overlay. We show that it is sufficient to have knowledge about 1st and 2nd neighbours, plus a simple interaction P2P protocol, to make unstructured networks resilient to node faults. A simulation assessment over different kinds of overlay networks demonstrates the viability of the proposal.
Fast centrality-driven diffusion in dynamic networks BIBAFull-Text 821-828
  Abraão Guimarães; Alex B. Vieira; Ana Paula Couto Silva; Artur Ziviani
Diffusion processes in complex dynamic networks can arise, for instance, on data search, data routing, and information spreading. Therefore, understanding how to speed up the diffusion process is an important topic in the study of complex dynamic networks. In this paper, we shed light on how centrality measures and node dynamics coupled with simple diffusion models can help on accelerating the cover time in dynamic networks. Using data from systems with different characteristics, we show that if dynamics is disregarded, network cover time is highly underestimated. Moreover, using centrality accelerates the diffusion process over a different set of complex dynamic networks when compared with the random walk approach. For the best case, in order to cover 80% of nodes, fast centrality-driven diffusion reaches an improvement of 60%, i.e. when next-hop nodes are selected by using centrality measures. Additionally, we also propose and present the first results on how link prediction can help on speeding up the diffusion process in dynamic networks.
Unveiling Zeus: automated classification of malware samples BIBAFull-Text 829-832
  Abedelaziz Mohaisen; Omar Alrawi
Malware family classification is an age old problem that many Anti-Virus (AV) companies have tackled. There are two common techniques used for classification, signature based and behavior based. Signature based classification uses a common sequence of bytes that appears in the binary code to identify and detect a family of malware. Behavior based classification uses artifacts created by malware during execution for identification. In this paper we report on a unique dataset we obtained from our operations and classified using several machine learning techniques using the behavior-based approach. Our main class of malware we are interested in classifying is the popular Zeus malware. For its classification we identify 65 features that are unique and robust for identifying malware families. We show that artifacts like file system, registry, and network features can be used to identify distinct malware families with high accuracy -- in some cases as high as 95 percent.

SIMPLEX'13 technical session 2

Using link semantics to recommend collaborations in academic social networks BIBAFull-Text 833-840
  Michele A. Brandão; Mirella M. Moro; Giseli Rabello Lopes; José P. M. Oliveira
Social network analysis (SNA) has been explored in many contexts with different goals. Here, we use concepts from SNA for recommending collaborations in academic networks. Recent work shows that research groups with well connected academic networks tend to be more prolific. Hence, recommending collaborations is useful for increasing a group's connections, then boosting the group research as a collateral advantage. In this work, we propose two new metrics for recommending new collaborations or intensification of existing ones. Each metric considers a social principle (homophily and proximity) that is relevant within the academic context. The focus is to verify how these metrics influence in the resulting recommendations. We also propose new metrics for evaluating the recommendations based on social concepts (novelty, diversity and coverage) that have never been used for such a goal. Our experimental evaluation shows that considering our new metrics improves the quality of the recommendations when compared to the state-of-the-art.
Addressing the privacy management crisis in online social networks BIBAFull-Text 841-842
  Krishna P. Gummadi
The sharing of personal data has emerged as a popular activity over online social networking sites like Facebook. As a result, the issue of online social network privacy has received significant attention in both the research literature and the mainstream media. Our overarching goal is to improve defaults and provide better tools for managing privacy, but we are limited by the fact that the full extent of the privacy problem remains unknown; there is little quantification of the incidence of incorrect privacy settings or the difficulty users face when managing their privacy. In this talk, I will first focus on measuring the disparity between the desired and actual privacy settings, quantifying the magnitude of the problem of managing privacy. Later, I will discuss how social network analysis techniques can be leveraged towards addressing the privacy management crisis.SNOW'13 Welcome and organization

SNOW'13 opening

Social media, journalism and the public BIBAFull-Text 847-848
  Steve Schifferes
This paper draws on the parallels between the current period and other periods of historic change in journalism to examine what is new in today's world of social media and what continuities there are with the past. It examines the changing relationship between the public and the press and how it is being continuously reinterpreted. It addresses the questions of whether we are the beginning or end of a process of revolutionary media change.
Weaving a safe web of news BIBAFull-Text 849-852
  Kanak Kiscuitwala; Willem Bult; Mathias Lécuyer; T. J. Purtell; Madeline K. B. Ross; Augustin Chaintreau; Chris Haseman; Monica S. Lam; Susan E. McGregor
The rise of social media and data-capable mobile devices in recent years has transformed the face of global journalism, supplanting the broadcast news anchor with a new source for breaking news: the citizen reporter. Social media's decentralized networks and instant re-broadcasting mechanisms mean that the reach of a single tweet can easily trump that of the most powerful broadcast satellite. Brief, text-based and easy to translate, social messages allow news audiences to skip the middleman and get news "straight from the source."
   Whether used by "citizen" or professional reporters, however, social media technologies can also pose risks that endanger these individuals and, by extension, the press as a whole. First, social media platforms are usually proprietary, leaving users' data and activities on the system open to scrutiny by collaborating companies and/or governments. Second, the networks upon which social media reporting relies are inherently fragile, consisting of easily targeted devices and relatively centralized message-routing systems that authorities may block or simply shut down. Finally, this same privileged access can be used to flood the network with inaccurate or discrediting messages, drowning the signal of real events in misleading noise.
   A citizen journalist can be anyone who is simply in the right place at the right time. Typically untrained and unevenly tech-savvy, citizen reporters are unaccustomed to thinking of their social media activities as high-risk, and may not consider the need to defend themselves against potential threats. Though often part of a crowd, they may have no formal affiliations; if targeted for retaliation, they may have nowhere to turn for help. The dangers citizen journalists face are personal and physical. They may be targeted in the act of reporting, and/or online through the tracking of their digital communications. Addressing their needs for protection, resilience, and recognition requires a move away from the major assumptions of in vitro communication security. For citizen journalists using social networks, the adversary is already inside, as the network itself may be controlled or influenced by the threatening party, while "outside" nodes, such as public figures, protest organizers, and other journalists can be trusted to handle content appropriately. In these circumstances there can be no seamless, guaranteed solution. Yet the need remains for technologies that improve the security of these journalists who in many cases may constitute a region's only independent press.
   In this paper, we argue that a comprehensive and collaborative effort is required to make publishing and interacting with news websites more secure. Journalists typically enjoy stronger legal protection at least in some countries, such as the United States. However, this protection may prove ineffective, as many online tools compromise source protection. In the remaining sections, we identify a set of discussion topics and challenges to encourage a broader research agenda aiming to address jointly the need for social features and security for citizens journalists and readers alike. We believe communication technologies should embrace the methods and possibilities of social news rather than treating this as a pure security problem. We briefly touch upon a related initiative, Dispatch, that focuses on providing security to citizen journalists for publisihing content.

SNOW'13 breaking the news

Traffic prediction and discovery of news via news crowds BIBFull-Text 853-854
  Carlos Castillo
Who broke the news?: an analysis on first reports of news events BIBAFull-Text 855-862
  Matthias Gallé; Jean-Michel Renders; Eric Karstens
We present a data-driven study on which sources were the first to report on news events. For this, we implemented a news-aggregator that included a large number of established news sources and covered one year of data. We present a novel framework that is able to retrieve a large number of events and not only the most salient ones, while at the same time making sure that they are not exclusively of local impact.
   Our analysis then focuses on different aspects of the news cycle. In particular we analyze which are the sources to break most of the news. By looking when certain events become bursty, we are able to perform a finer analysis on those events and the associated sources that dominate the global news-attention. Finally we study the time it takes news outlet to report on these events and how this reects different strategies of which news to report.
   A general finding of our study is that big news agencies remain an important threshold to cross to bring global attention to particular news, but it also shows the importance of focused (by region or topic) outlets.

SNOW'13 social news

Finding news curators in Twitter BIBAFull-Text 863-870
  Janette Lehmann; Carlos Castillo; Mounia Lalmas; Ethan Zuckerman
Users interact with online news in many ways, one of them being sharing content through online social networking sites such as Twitter. There is a small but important group of users that devote a substantial amount of effort and care to this activity. These users monitor a large variety of sources on a topic or around a story, carefully select interesting material on this topic, and disseminate it to an interested audience ranging from thousands to millions. These users are news curators, and are the main subject of study of this paper. We adopt the perspective of a journalist or news editor who wants to discover news curators among the audience engaged with a news site.
   We look at the users who shared a news story on Twitter and attempt to identify news curators who may provide more information related to that story. In this paper we describe how to find this specific class of curators, which we refer to as news story curators. Hence, we proceed to compute a set of features for each user, and demonstrate that they can be used to automatically find relevant curators among the audience of two large news organizations.
Towards automatic assessment of the social media impact of news content BIBAFull-Text 871-874
  Tom De Nies; Gerald Haesendonck; Fréderic Godin; Wesley De Neve; Erik Mannens; Rik Van de Walle
In this paper, we investigate the possibilities to estimate the impact the content of a news article has on social media, and in particular on Twitter. We propose an approach that makes use of captured and temporarily stored microposts found in social media, and compares their relevance to an arbitrary news article. These results are used to derive key indicators of the social media impact of the specified content. We describe each step of our approach, provide a first implementation, and discuss the most imminent challenges and discussion points.
Verifying news on the social web: challenges and prospects BIBAFull-Text 875-878
  Steve Schifferes; Nic Newman
The problem of verification is the key issue for journalists who use social media. This paper argues for the importance of a user-centered approach in finding solutions to this problem. Because journalists have different needs for different types of stories, there is no one magic bullet that can verify social media. Any tool will need to have a multi-faceted approach to the problem, and will have to be adjustable to suit the particular needs of individual journalists and news organizations.
Newspaper editors vs the crowd: on the appropriateness of front page news selection BIBAFull-Text 879-880
  Arkaitz Zubiaga
The front page is the showcase that might condition whether one buys a newspaper, and so editors carefully select the news of the day that they believe will attract as many readers as possible. Little is known about the extent to which editors' criteria for front page news selection are appropriate so as to matching the actual interests of the crowd. In this paper, we compare the news stories in The New York Times over the period of a year to their popularity on Twitter and Facebook. Our study questions the current news selection criteria, revealing that while editors focus on picking hard news such as politics for the front page, social media users are rather into soft news such as science and fashion.SOCM'13 welcome and organization

SOCM'13 technical presentations

Social machines: a unified paradigm to describe social web-oriented systems BIBAFull-Text 885-890
  Vanilson Buregio; Silvio Meira; Nelson Rosa
Blending computational and social elements into software has gained significant attention in key conferences and journals. In this context, "Social Machines" appears as a promising model for unifying both computational and social processes. However, it is a fresh topic, with concepts and definitions coming from different research fields, making a unified understanding of the concept a somewhat challenging endeavor. This paper aims to investigate efforts related to this topic and build a preliminary classification scheme to structure the science of Social Machines. We provide a preliminary overview of this research area through the identification of the main visions, concepts, and approaches; we additionally examine the result of the convergence of existing contributions. With the field still in its early stage, we believe that this work can collaborate to the process of providing a more common and coherent conceptual basis for understanding Social Machines as a paradigm. Furthermore, this study helps detect important research issues and gaps in the area.
Crime applications and social machines: crowdsourcing sensitive data BIBAFull-Text 891-896
  Maire Byrne Evans; Kieron O'Hara; Thanassis Tiropanis; Craig Webber
The authors explore some issues with the United Kingdom (U.K.) crime reporting and recording systems which currently produce Open Crime Data. The availability of Open Crime Data seems to create a potential data ecosystem which would encourage crowdsourcing, or the creation of social machines, in order to counter some of these issues. While such solutions are enticing, we suggest that in fact the theoretical solution brings to light fairly compelling problems, which highlight some limitations of crowdsourcing as a means of addressing Berners-Lee's "social constraint." The authors present a thought experiment -- a Gendankenexperiment -- in order to explore the implications, both good and bad, of a social machine in such a sensitive space and suggest a Web Science perspective to pick apart the ramifications of this thought experiment as a theoretical approach to the characterisation of social machines.
Pseudonymity in social machines BIBAFull-Text 897-900
  Ben Dalton
This paper describes the potential of systems in which many people collectively control a single constructed identity mediated by socio-technical networks. By looking to examples of identities that have spontaneously emerged from anonymous communities online, a model for pseudonym design in social machines is proposed. A framework of identity dimensions is presented as a means of exploring the functional types of identity encountered in social machines, and design guidelines are outlined that suggest possible approaches to this task.
Observing social machines part 1: what to observe? BIBAFull-Text 901-904
  David De Roure; Clare Hooper; Megan Meredith-Lobay; Kevin Page; Ségolène Tarte; Don Cruickshank; Catherine De Roure
As a scoping exercise in the design of our Social Machines Observatory we consider the observation of Social Machines "in the wild", as illustrated through two scenarios. More than identifying and classifying individual machines, we argue that we need to study interactions between machines and observe them throughout their lifecycle. We suggest that purpose may be a key notion to help identify individual Social Machines in composed systems, and that mixed observation methods will be required. This exercise provides a basis for later work on how we instrument and observe the ecosystem.
Towards a classification framework for social machines BIBAFull-Text 905-912
  Nigel R. Shadbolt; Daniel A. Smith; Elena Simperl; Max Van Kleek; Yang Yang; Wendy Hall
The state of the art in human interaction with computational systems blurs the line between computations performed by machine logic and algorithms, and those that result from input by humans, arising from their own psychological processes and life experience. Current socio-technical systems, known as "social machines" exploit the large-scale interaction of humans with machines. Interactions that are motivated by numerous goals and purposes including financial gain, charitable aid, and simply for fun. In this paper we explore the landscape of social machines, both past and present, with the aim of defining an initial classificatory framework. Through a number of knowledge elicitation and refinement exercises we have identified the polyarchical relationship between infrastructure, social machines, and large-scale social initiatives. Our initial framework describes classification constructs in the areas of contributions, participants, and motivation. We present an initial characterisation of some of the most popular social machines, as demonstration of the use of the identified constructs. We believe that it is important to undertake an analysis of the behaviour and phenomenology of social machines, and of their growth and evolution over time. Our future work will seek to elicit additional opinions, classifications and validation from a wider audience, to produce a comprehensive framework for the description, analysis and comparison of social machines.
Linked data in crowdsourcing purposive social network BIBAFull-Text 913-918
  Priyanka Singh; Nigel Shadbolt
Internet is an easy medium for people to collaborate and crowdsourcing is an efficient feature of social web where people with common interest and expertise come together to solve specific problems by collective thinking and create a community. It can also be used to filter out important information from large data, remove spams, and gamification techniques are used to reward the users for their contribution and keep a sustainable environment for the growth of the community. Semantic web technologies can be used to structure the community data so it can be combined, decentralized and be used across platform. Using such tools knowledge can be enhanced and easily discovered and merged together. This paper discusses the concept of a purposive social network where people with similar interest and varied expertise come together, use crowdsourcing technique to solve a common problem and build tools for common purpose. The StackOverflow website is chosen to study the purposive network, different network ties and roles of user is studied. Linked Data is used for name disambiguation of keywords and topics for easier search and discovery of experts in a field and provide useful information that is otherwise unavailable in the website.
A few thoughts on engineering social machines: extended abstract BIBAFull-Text 919-920
  Markus Strohmaier
Social machines are integrated systems of people and computers. What distinguishes social machines from other types of software systems -- such as software for cars or air planes -- is the unprecedented involvement of data about user behavior, -goals and -motivations into the software system's structure. In social machines, the interaction between a user and the system is mediated by the aggregation of explicit or implicit data from other users. This is the case with systems where, for example, user data is used to suggest search terms (e.g. Google Autosuggest), to recommend products (e.g. Amazon recommendations), to aid navigation (e.g. tag-based navigation) or to filter content (e.g. Digg.com). This makes social machines a novel class of software systems (as opposed to for example safety-related software that is being used in cars) and unique in a sense that potentially essential system properties and functions -- such as navigability -- are dynamically influenced by aggregate user behavior. Such properties can not be satisfied through the implementation of requirements alone, what is needed is regulation, i.e. a dynamic integration of users' goals and behavior into the continuous process of engineering.
   Functional and non-functional properties of software systems have been the subject of software engineering research for decades [1]. The notion of non-functional requirements (softgoals) captures a recognition by the software engineering community that software requirements can be subjective and interdependent, they can lack a clear-cut success criteria, exhibit different priorities and can require decomposition or operationalization. Resulting approaches to analyzing and designing software systems emphasize the role of users (or more general: agents) in this process (such as [1]). i* for example has been used to capture and represent user goals during system design and run time.
   With the emergence of social machines, such as the WWW, and social-focussed applications running on top of the web, such as facebook.com, delicious.com and others, social machines and their emergent properties have become a crucial infrastructure for many aspects of our daily lives. To give an example: the navigability of the web depends on the behavior of web editors who are interlinking documents, or the usefulness of tags for classification depends on the tagging behavior of users [2]. The rise of social machines can be expected to fundamentally change the way in which such properties and functions of software systems are designed and maintained. Rather than planning for certain system properties (such as navigability, usefulness for certain tasks) and functions at design time, the task of engineers is to build a platform which allows to influence and regulate emergent user behavior in such a way that desired system attributes are achieved at run time. It is through the process of social computation, i.e. the combination of social behavior and algorithmic computation, that desired system properties and functions emerge.
   For a science of social machines, specifically understanding the relationship between individual and social behavior on one hand, and desired system properties and functions on the other is crucial. In order to maintain control, research must focus on understanding a wide variety of social machine properties such as semantic, intentional and navigational properties across different systems and applications including -- but not limited to -- social media. Summarizing, the full implications of the genesis of social machines for related domains including software engineering, knowledge acquisition or peer production systems are far from being well understood, and warrant future work. For example, the interactions between the pragmatics of such systems (how they are used) and the semantics emerging in those systems (what the words, symbols, etc mean) is a fundamental issue that deserves greater attention. Equipping engineers of social machines with the right tools to achieve and maintain desirable system properties is a problem of practical relevance that needs to be addressed by future research.
The HTP model: understanding the development of social machines BIBAFull-Text 921-926
  Ramine Tinati; Leslie Carr; Susan Halford; Catherine J. Pope
The Web represents a collection of socio-technical activities inter-operating using a set of common protocols and standards. Online banking, web TV, internet shopping, e-government and social networking are all different kinds of human interaction that have recently leveraged the capabilities of the Web architecture. Activities that have human and computer components are referred to as social machines. This paper introduces HTP, a socio-technical model to understand, describe and analyze the formation and development of social machines and other web activities. HTP comprises three components: heterogeneous networks of actors involved in a social machine; the iterative process of translation of the actors' activities into a temporarily stable and sustainable social machine; and the different phases of this machine's adaptation from one stable state to another as the surrounding networks restructure and global agendas ebb and flow. The HTP components are drawn from an interdisciplinary range of theoretical positions and concepts. HTP provides an analytical framework to explain why different Web activities remain stable and functional, whilst others fail. We illustrate the use of HTP by examining the formation of a classic social machine (Wikipedia), and the stabilization points corresponding to its different phases of development.
"the crowd keeps me in shape": social psychology and the present and future of health social machines BIBAFull-Text 927-932
  Max Van Kleek; Daniel A. Smith; Wendy Hall; Nigel Shadbolt
Can the Web help people live healthier lives? This paper seeks to answer this question through an examination of sites, apps and online communities designed to help people improve their fitness, better manage their disease(s) and conditions, and to solve the often elusive connections between the symptoms they experience, diseases and treatments. These health social machines employ a combination of both simple and complex social and computational processes to provide such support. We first provide a descriptive classification of the kinds of machines currently available, and the support each class offers. We then describe the limitations exhibited by these systems and potential ways around them, towards the design of more effective machines in the future.SRS'13 welcome and organization

SRS'13 keynote talks

How status and reputation shape human evaluations: consequences for recommender systems BIBAFull-Text 937-938
  Jure Leskovec
Recommender systems are inherently driven by evaluations and reviews provided by the users of these systems. Understanding ways in which users form judgments and produce evaluations can provide insights for modern recommendation systems. Many online social applications include mechanisms for users to express evaluations of one another, or of the content they create. In a variety of domains, mechanisms for evaluation allow one user to say whether he or she trusts another user, or likes the content they produced, or wants to confer special levels of authority or responsibility on them. We investigate a number of fundamental ways in which user and item characteristics affect the evaluations in online settings. For example, evaluations are not unidimensional but include multiple aspects that all together contribute to user's overall rating. We investigate methods for modeling attitudes and attributes from online reviews that help us better understand user's individual preferences. We also examine how to create a composite description of evaluations that accurately reflects some type of cumulative opinion of a community. Natural applications of these investigations include predicting the evaluation outcomes based on user characteristics and to estimate the chance of a favorable overall evaluation from a group knowing only the attributes of the group's members, but not their expressed opinions.
Large-scale social recommender systems: challenges and opportunities BIBAFull-Text 939-940
  Mitul Tiwari
Online social networks have become very important for networking, communication, sharing, and content discovery. Recommender systems play a significant role on any online social network for engaging members, recruiting new members, and recommending other members to connect with. This talk presents challenges in recommender systems, graph analysis, social stream relevance and virality on a large-scale social networks such as LinkedIn, the largest professional network with more than 200M members.
   First, social recommender systems for recommending jobs, groups, companies to follow, other members to connect with, are very important part of a professional network like LinkedIn [1, 6, 7, 9]. Each one of these entity recommender systems present novel challenges to use social and member generated data. Second, various problems, such as, link prediction, visualizing connection network, finding the strength of each connection, and the best path among members, require large-scale social graph analysis, and present unique research opportunities [2, 5]. Third, social stream relevance and capturing virality in social products are crucial for engaging users on any online social network [4]. Final, systems challenges must be addressed in scaling recommender systems on a large-scale social networks [3, 8, 10]. This talk presents challenges and interesting problems in large-scale social recommender systems, and describes some of the solutions.

SRS'13 technical presentations

Signal-based user recommendation on Twitter BIBAFull-Text 941-944
  Giuliano Arru; Davide Feltoni Gurini; Fabio Gasparetti; Alessandro Micarelli; Giuseppe Sansonetti
In recent years, social networks have become one of the best ways to access information. The ease with which users connect to each other and the opportunity provided by Twitter and other social tools in order to follow person activities are increasing the use of such platforms for gathering information. The amount of available digital data is the core of the new challenges we now face. Social recommender systems can suggest both relevant content and users with common social interests. Our approach relies on a signal-based model, which explicitly includes a time dimension in the representation of the user interests. Specifically, this model takes advantage of a signal processing technique, namely, the wavelet transform, for defining an efficient pattern-based similarity function among users. Experimental comparisons with other approaches show the benefits of the proposed approach.
Generation of coalition structures to provide proper groups' formation in group recommender systems BIBAFull-Text 945-950
  Lucas Augusto M. C. Carvalho; Hendrik T. Macedo
Group recommender systems usually provide recommendations to a fixed and predetermined set of members. In some situations, however, there is a set of people (N) that should be organized into smaller and cohesive groups, so it is possible to provide more effective recommendations to each of them. This is not a trivial task. In this paper we propose an innovative approach for grouping people within the recommendation problem context. The problem is modeled as a coalitional game from Game Theory. The goal is to group people into exhaustive and disjoint coalitions so as to maximize the social welfare function of the group. The optimal coalition structure is that with highest summation over all social welfare values. Similarities between recommendation system users are used to define the social welfare function. We compare our approach with K-Means clustering for a dataset from Movielens. Results have shown that the proposed approach performs better than K-Means for both average group satisfaction and Davies-Bouldin index metrics when the number of coalitions found is not greater than 4 (K N = 12).
Users' satisfaction in recommendation systems for groups: an approach based on noncooperative games BIBAFull-Text 951-958
  Lucas Augusto Montalvão Costa Carvalho; Hendrik Teixeira Macedo
A major difficulty in a recommendation system for groups is to use a group aggregation strategy to ensure, among other things, the maximization of the average satisfaction of group members. This paper presents an approach based on the theory of noncooperative games to solve this problem. While group members can be seen as game players, the items for potential recommendation for the group comprise the set of possible actions. Achieving group satisfaction as a whole becomes, then, a problem of finding the Nash equilibrium. Experiments with a MovieLens dataset and a function of arithmetic mean to compute the prediction of group satisfaction for the generated recommendation have shown statistically significant results when compared to state-of-the-art aggregation strategies, in particular, when evaluation among group members are more heterogeneous. The feasibility of this unique approach is shown by the development of an application for Facebook, which recommends movies to groups of friends.
Recommending collaborators using keywords BIBAFull-Text 959-962
  Sara Cohen; Lior Ebel
This paper studies the problem of recommending collaborators in a social network, given a set of keywords. Formally, given a query q, consisting of a researcher s (who is a member of a social network) and a set of keywords k (e.g., an article name or topic of future work), the collaborator recommendation problem is to return a high-quality ranked list of possible collaborators for s on the topic k. Extensive effort was expended to define ranking functions that take into consideration a variety of properties, including structural proximity to s, textual relevance to k, and importance. The effectiveness of our methods have been experimentally proven over two large subsets of the social network determined by DBLP co-authorship data. The results show that the ranking methods developed in this paper work well in practice.
A recommender system for job seeking and recruiting website BIBAFull-Text 963-966
  Yao Lu; Sandy El Helou; Denis Gillet
In this paper, a hybrid recommender system for job seeking and recruiting websites is presented. The various interaction features designed on the website help the users organize the resources they need as well as express their interest. The hybrid recommender system exploits the job and user profiles and the actions undertaken by users in order to generate personalized recommendations of candidates and jobs. The data collected from the website is modeled using a directed, weighted, and multi-relational graph, and the 3A ranking algorithm is exploited to rank items according to their relevance to the target user. A preliminary evaluation is conducted based on simulated data and production data from a job hunting website in Switzerland.
Weighted slope one predictors revisited BIBAFull-Text 967-972
  Danilo Menezes; Anisio Lacerda; Leila Silva; Adriano Veloso; Nivio Ziviani
Recommender systems are used to help people in specific life choices, like what items to buy, what news to read or what movies to watch. A relevant work in this context is the Slope One algorithm, which is based on the concept of differential popularity between items (i.e., how much better one item is liked than another). This paper proposes new approaches to extend Slope One based predictors for collaborative filtering, in which the predictions are weighted based on the number of users that co-rated items. We propose to improve collaborative filtering by exploiting the web of trust concept, as well as an item utility measure based on the error of predictions based on specific items to specific users. We performed experiments using three application scenarios, namely Movielens, Epinions, and Flixter. Our results demonstrate that, in most cases, exploiting the web of trust is benefitial to prediction performance, and improvements are reported when comparing the proposed approaches against the original Weighted Slope One algorithm.
Profile diversity in search and recommendation BIBAFull-Text 973-980
  Maximilien Servajean; Esther Pacitti; Sihem Amer-Yahia; Pascal Neveu
We investigate profile diversity, a novel idea in searching scientific documents. Combining keyword relevance with popularity in a scoring function has been the subject of different forms of social relevance [2, 6, 9]. Content diversity has been thoroughly studied in search and advertising [4, 11], database queries [16, 5, 8], and recommendations [17, 10, 18]. We believe our work is the first to investigate profile diversity to address the problem of returning highly popular but too-focused documents. We show how to adapt Fagin's threshold-based algorithms to return the most relevant and most popular documents that satisfy content and profile diversities and run preliminary experiments on two benchmarks to validate our scoring function.
Does social contact matter?: modelling the hidden web of trust underlying Twitter BIBAFull-Text 981-988
  Mozhgan Tavakolifard; Kevin C. Almeroth; Jon Atle Gulla
Social recommender systems aim to alleviate the information overload problem on social network sites. The social network structure is often an important input to these recommender systems. Typically, this structure cannot be inferred directly from declared relationships among users. The goal of our work is to extract an underlying hidden and sparse network which more strongly represents the actual interactions among users. We study how to leverage Twitter activities like micro-blogging and the network structure to find a simple, efficient, but accurate method to infer and expand this hidden network. We measure and compare the performance of several different modeling strategies using a crawled data set from Twitter. Our results reveal that the structural similarity in the network generated by users' retweeting behavior outweighs the other discussed methods.
Understanding user spatial behaviors for location-based recommendations BIBAFull-Text 989-992
  Jun Zhang; Chun-yuen Teng; Yan Qu
In this paper, we introduce a network-based method to study user spatial behaviors based on check-in histories. The results of this study have direct implications for location-based recommendation systems.SWDM'13 welcome and organization

SWDM'13 keynote

Disasters response using social life networks BIBAFull-Text 997-998
  Ramesh C. Jain
Connecting people to required resources efficiently, effectively and promptly is one of the most important challenges for our society. Disasters make it the challenge for life and death. During disasters many normal sources of information to assess situations as well as distributing vital information to individuals break down. Unfortunately, during disastrous situations, most current practices are forced to follow bureaucratic processes and procedures that may delay help in critical life and death moments. Social media brings together different media as well as modes of distribution -- focused, narrowcast, and broadcast -- and has revolutionized communication among people. Mobile phones, equipped with myriads of sensors are bringing the next generation of social networks not only to connect people with other people, but also to connect people with other people and essential life resources based on the disaster situation and personal context. We believe that such Social Life Networks (SLN) may play very important role for solving some essential human problems, including providing vital help to people during disasters. We will present early design of such systems and use a few examples of such systems explored in our group during disasters. Focused Micro Blogs (FMBs) will be discussed as an alternative to less noisy and more direct versions of current microblogs, such as Tweets and Status Updates. An important part of our discussion will be to list challenges and opportunities in this area.

SWDM'13 twitter in action

A sensitive Twitter earthquake detector BIBAFull-Text 999-1002
  Bella Robinson; Robert Power; Mark Cameron
This paper describes early work at developing an earthquake detector for Australia and New Zealand using Twitter. The system is based on the Emergency Situation Awareness (ESA) platform which provides all-hazard information captured, filtered and analysed from Twitter. The detector sends email notifications of evidence of earthquakes from Tweets to the Joint Australian Tsunami Warning Centre.
   The earthquake detector uses the ESA platform to monitor Tweets and checks for specific earthquake related alerts. The Tweets that contribute to an alert are then examined to determine their locations: when the Tweets are identified as being geographically close and the retweet percentage is low an email notification is generated.
   The earthquake detector has been in operation since December 2012 with 31 notifications generated where 17 corresponded with real, although minor, earthquake events. The remaining 14 were a result of discussions about earthquakes but not prompted by an event. A simple modification to our algorithm results in 20 notifications identifying the same 17 real events and reducing the false positives to 3. Our detector is sensitive in that it can generate alerts from only a few Tweets when they are determined to be geographically close.
Text vs. images: on the viability of social media to assess earthquake damage BIBAFull-Text 1003-1006
  Yuan Liang; James Caverlee; John Mander
In this paper, we investigate the potential of social media to provide rapid insights into the location and extent of damage associated with two recent earthquakes -- the 2011 Tohoku earthquake in Japan and the 2011 Christchurch earthquake in New Zealand. Concretely, we (i) assess and model the spatial coverage of social media; and (ii) study the density and dynamics of social media in the aftermath of these two earthquakes. We examine the difference between text tweets and media tweets (containing links to images and videos), and investigate tweet density, re-tweet density, and user tweeting count to estimate the epicenter and to model the intensity attenuation of each earthquake. We find that media tweets provide more valuable location information, and that the relationship between social media activity vs. loss/damage attenuation suggests that social media following a catastrophic event can provide rapid insight into the extent of damage.
Comparing web feeds and tweets for emergency management BIBAFull-Text 1007-1010
  Robert Power; Bella Robinson; Catherine Wise
This paper describes ongoing work with the Australian Government to assemble information from a collection of web feeds describing emergency incidents of interest for emergency managers. The developed system, the Emergency Response Intelligence Capability (ERIC) tool, has been used to gather information about emergency events during the Australian summer of 2012/13. The web feeds are an authoritative source of structured information summarising incidents that includes links to emergency services web sites containing further details about the events underway.
   The intelligence obtained using ERIC for a specific fire event has been compared with information that was available in Twitter using the Emergency Situation Awareness (ESA) platform. This information would have been useful as a new source of intelligence: it was reported faster than via the web feed, contained more specific event information, included details of impact to the community, was updated more frequently, included information from the public and remains available as a source of information long after the web feed contents have been removed.

SWDM'13 keynote 2

Leveraging on social media to support the global building resilient cities campaign BIBAFull-Text 1011-1012
  David Stevens
This paper presents a summary of the main points put forward during the presentation delivered at the 2nd International Workshop on Social Web for Disaster Management which was held in conjunction with WWW 2013 on May 14th 2013 in Rio de Janeiro, Brazil.

SWDM'13 insights from social web

Location-based insights from the social web BIBAFull-Text 1013-1016
  Yohei Ikawa; Maja Vukovic; Jakob Rogstadius; Akiko Murakami
Citizens, news reporters, relief organizations, and governments are increasingly relying on the Social Web to report on and respond to disasters as they occur. The capability to rapidly react to important events, which can be identified from high-volume streams even when the sources are unknown, still requires precise localization of the events and verification of the reports. In this paper, we propose a framework for classifying location elements and a method for their extraction from Social Web data. We describe the framework in the context of existing Social Web systems used for disaster management. We present a new location-inferencing architecture and evaluate its performance with a data set from a real-world disaster.
Location extraction from disaster-related microblogs BIBAFull-Text 1017-1020
  John Lingad; Sarvnaz Karimi; Jie Yin
Location information is critical to understanding the impact of a disaster, including where the damage is, where people need assistance and where help is available. We investigate the feasibility of applying Named Entity Recognizers to extract locations from microblogs, at the level of both geo-location and point-of-interest. Our experimental results show that such tools once retrained on microblog data have great potential to detect the where information, even at the granularity of point-of-interest.
Practical extraction of disaster-relevant information from social media BIBAFull-Text 1021-1024
  Muhammad Imran; Shady Elbassuoni; Carlos Castillo; Fernando Diaz; Patrick Meier
During times of disasters online users generate a significant amount of data, some of which are extremely valuable for relief efforts. In this paper, we study the nature of social-media content generated during two different natural disasters. We also train a model based on conditional random fields to extract valuable information from such content. We evaluate our techniques over our two datasets through a set of carefully designed experiments. We also test our methods over a non-disaster dataset to show that our extraction model is useful for extracting information from socially-generated content in general.
Information sharing on Twitter during the 2011 catastrophic earthquake BIBAFull-Text 1025-1028
  Fujio Toriumi; Takeshi Sakaki; Kosuke Shinoda; Kazuhiro Kazama; Satoshi Kurihara; Itsuki Noda
Such large disasters as earthquakes and hurricanes are very unpredictable. During a disaster, we must collect information to save lives. However, in time disaster, it is difficult to collect information which is useful for ourselves from such traditional mass media as TV and newspapers that contain information for the general public. Social media attract attention for sharing information, especially Twitter, which is a hugely popular social medium that is now being used during disasters. In this paper, we focus on the information sharing behaviors on Twitter during disasters. We collected data before and during the Great East Japan Earthquake and arrived at the following conclusions: Many users with little experience with such specific functions as reply and retweet did not continuously use them after the disaster. Retweets were well used to share information on Twitter. Retweets were used not only for sharing the information provided by general users but used for relaying the information from the mass media.
   We conclude that social media users changed their behavior to widely diffuse important information and decreased non-emergency tweets to avoid interrupting critical information.
Information verification during natural disasters BIBAFull-Text 1029-1032
  Abdulfatai Popoola; Dmytro Krasnoshtan; Attila-Peter Toth; Victor Naroditskiy; Carlos Castillo; Patrick Meier; Iyad Rahwan
Large amounts of unverified and at times contradictory information often appear on social media following natural disasters. Timely verification of this information can be crucial to saving lives and for coordinating relief efforts. Our goal is to enable this verification by developing an online platform that involves ordinary citizens in the evidence gathering and evaluation process. The output of this platform will provide reliable information to humanitarian organizations, journalists, and decision makers involved in relief efforts.TempWeb'13 welcome and organization

TEMPWEB'13 keynote talk

Timelines as summaries of popular scheduled events BIBAFull-Text 1037-1044
  Omar Alonso; Kyle Shiells
Known events that are scheduled in advance, such as popular sports games, usually get a lot of attention from the public. Communications media like TV, radio, and newspapers will report the salient aspects of such events live or post-hoc for general consumption. However, certain actions, facts, and opinions would likely be omitted from those objective summaries. Our approach is to construct a particular game's timeline in such a way that it can be used as a quick summary of the main events that happened along with popular subjective and opinionated items that the public inject. Peaks in the volume of posts discussing the event reflect both objectively recognizable events in the game -- in the sports example, a change in score -- and subjective events such as a referee making a call fans disagree with. In this work, we introduce a novel timeline design that captures a more complete story of the event by placing the volume of Twitter posts alongside keywords that are driving the additional traffic. We demonstrate our approach using events of major international social impact from the World Cup 2010 and evaluate against professional liveblog coverage of the same events.

TEMPWEB'13 web archiving

A survey of web archive search architectures BIBAFull-Text 1045-1050
  Miguel Costa; Daniel Gomes; Francisco Couto; Mário Silva
Web archives already hold more than 282 billion documents and users demand full-text search to explore this historical information. This survey provides an overview of web archive search architectures designed for time-travel search, i.e. full-text search on the web within a user-specified time interval. Performance, scalability and ease of management are important aspects to take in consideration when choosing a system architecture. We compare these aspects and initialize the discussion of which search architecture is more suitable for a large-scale web archive.
Archival HTTP redirection retrieval policies BIBAFull-Text 1051-1058
  Ahmed AlSum; Michael L. Nelson; Robert Sanderson; Herbert Van de Sompel
When retrieving archived copies of web resources (mementos) from web archives, the original resource's URI-R is typically used as the lookup key in the web archive. This is straightforward until the resource on the live web issues a redirect: R ->R. Then it is not clear if R or R should be used as the lookup key to the web archive. In this paper, we report on a quantitative study to evaluate a set of policies to help the client discover the correct memento when faced with redirection. We studied the stability of 10,000 resources and found that 48% of the sample URIs tested were not stable, with respect to their status and redirection location. 27% of the resources were not perfectly reliable in terms of the number of mementos of successful responses over the total number of mementos, and 2% had a reliability score of less than 0.5. We tested two retrieval policies. The first policy covered the resources which currently issue redirects and successfully resolved 17 out of 77 URIs that did not have mementos of the original URI, but did of the resource that was being redirected to. The second policy covered archived copies with HTTP redirection and helped the client in 58% of the cases tested to discover the nearest memento to the requested datetime.
Creating a billion-scale searchable web archive BIBAFull-Text 1059-1066
  Daniel Gomes; Miguel Costa; David Cruz; João Miranda; Simão Fontes
Web information is ephemeral. Several organizations around the world are struggling to archive information from the web before it vanishes. However, users demand efficient and effective search mechanisms to access the already vast collections of historical information held by web archives. The Portuguese Web Archive is the largest full-text searchable web archive publicly available. It supports search over 1.2 billion files archived from the web since 1996. This study contributes with an overview of the lessons learned while developing the Portuguese Web Archive, focusing on web data acquisition, ranking search results and user interface design. The developed software is freely available as an open source project. We believe that sharing our experience obtained while developing and operating a running service will enable other organizations to start or improve their web archives.

TEMPWEB'13 identifying and leveraging time information

Discovering temporal hidden contexts in web sessions for user trail prediction BIBAFull-Text 1067-1074
  Julia Kiseleva; Hoang Thanh Lam; Mykola Pechenizkiy; Toon Calders
In many web information systems such as e-shops and information portals, predictive modeling is used to understand user's intentions based on their browsing behaviour. User behavior is inherently sensitive to various hidden contexts. It has been shown in different experimental studies that exploitation of contextual information can help in improving prediction performance significantly. It is reasonable to assume that users may change their intents during one web session and that changes are influenced by some external factors such as switch in temporal context e.g. 'users want to find information about a specific product' and after a while 'they want to buy this product'. A web session can be represented as a sequence of user's actions where actions are ordered by time. The generation of a web session might be influenced by several hidden temporal contexts. Each session can be represented as a concatenation of independent segments, each of which is influenced by one corresponding context. We show how to learn how to apply different predictive models for each segment in this work. We define the problem of discovering temporal hidden contexts in such way that we optimize directly the accuracy of predictive models (e.g. users' trails prediction) during the process of context acquisition. Our empirical study on a real dataset demonstrates the effectiveness of our method.
Carbon dating the web: estimating the age of web resources BIBAFull-Text 1075-1082
  Hany M. SalahEldeen; Michael L. Nelson
In the course of web research it is often necessary to estimate the creation datetime for web resources (in the general case, this value can only be estimated). While it is feasible to manually establish likely datetime values for small numbers of resources, this becomes infeasible if the collection is large. We present "carbon date", a simple web application that estimates the creation date for a URI by polling a number of sources of evidence and returning a machine-readable structure with their respective values. To establish a likely datetime, we poll bitly for the first time someone shortened the URI, topsy for the first time someone tweeted the URI, a Memento aggregator for the first time it appeared in a public web archive, Google's time of last crawl, and the Last-Modified HTTP response header of the resource itself. We also examine the backlinks of the URI as reported by Google and apply the same techniques for the resources that link to the URI. We evaluated our tool on a gold standard data set of 1200 URIs in which the creation date was manually verified. We were able to estimate a creation date for 75.90% of the resources, with 32.78% having the correct value. Given the different nature of the URIs, the union of the various methods produces the best results. While the Google last crawl date and topsy account for nearly 66% of the closest answers, eliminating the web archives or Last-Modified from the results produces the largest overall negative impact on the results. The carbon date application is available for download or use via a web API.
Stuff happens continuously: exploring web contents with temporal information BIBAFull-Text 1083-1084
  Omar Alonso
In the last few years there has been an increased interest from researchers and practitioners in exploring time as a dimension that can benefit several information retrieval tasks. There is exciting work in analyzing and exploiting temporal information embedded in documents as relevance cues for the presentation, organization, and the exploration of search results in a temporal context.
   Most of the current approaches focus on leveraging the temporal information available in document sources like web pages or news articles. However, the Web keeps evolving beyond simple web pages and new information sources and services are adopted very rapidly. For example, the incredible amount of content that is generated by users in social networks offers another aspect to examine how people produce and consume content over time.
   We review the current activities centered on identifying and extracting time information from document collections and the applications to the information seeking process. We outline the potential of new sources for studying temporal information by presenting new problems. Finally, we discuss a number of scenarios where a temporal perspective can provide insights when exploring Web contents.

TEMPWEB'13 studies and experience sharing

Characterizing video access patterns in mainstream media portals BIBAFull-Text 1085-1092
  Lucas C. O. Miranda; Rodrygo L. T. Santos; Alberto H. F. Laender
Watching online videos is part of the daily routine of a considerable fraction of Internet users nowadays. Understanding the patterns of access to these videos is paramount for improving the capacity planning for video providers, the conversion rate for advertisers, and the relevance of the whole online video watching experience for end users. While much research has been conducted to analyze video access patterns in user-generated content (UGC), little is known of how such patterns manifest in mainstream media (MSM) portals. In this paper, we perform the first large-scale analysis of video access patterns in MSM portals. As a case study, we analyze interaction logs across a total of 38 Brazilian MSM portals, including six of the largest portals in the country, over a period of eight weeks. Our analysis reveals interesting static and temporal video access patterns in MSM portals, which we compare and contrast to the access patterns reported for UGC websites. Overall, our analysis provides several insights for an improved understanding of video access on the Internet beyond UGC websites.
Adaptive crowdsourcing for temporal crowds BIBAFull-Text 1093-1100
  L. Elisa Celis; Koustuv Dasgupta; Vaibhav Rajan
Crowdsourcing is rapidly emerging as a computing paradigm that can employ the collective intelligence of a distributed human population to solve a wide variety of tasks. However, unlike organizational environments where workers have set work hours, known skill sets and performance indicators that can be monitored and controlled, most crowdsourcing platforms leverage the capabilities of fleeting workers who exhibit changing work patterns, expertise, and quality of work. Consequently, platforms exhibit significant variability in terms of performance characteristics (like response time, accuracy, and completion rate). While this variability has been folklore in the crowdsourcing community, we are the first to show data that displays this kind of changing behavior. Notably, these changes are not due to a distribution with high variance; rather, the distribution itself is changing over time.
   Deciding which platform is most suitable given the requirements of a task is of critical importance in order to optimize performance; further, making the decision(s) adaptively to accommodate the dynamically changing crowd characteristics is a problem that has largely been ignored. In this paper, we address the changing crowds problem and, specifically, propose a multi-armed bandit based framework. We introduce the simple epsilon-smart algorithm that performs robustly. Counterfactual results based on real-life data from two popular crowd platforms demonstrate the efficacy of the proposed approach. Further simulations using a random-walk model for crowd performance demonstrate its scalability and adaptability to more general scenarios.
A survey of temporal web search experience BIBAFull-Text 1101-1108
  Hideo Joho; Adam Jatowt; Blanco Roi
Temporal aspects of web search have gained a great level of attention in the recent years. However, many of the research attempts either focused on technical development of various tools or behavioral analysis based on log data. This paper presents the results of user survey carried out to investigate the practice and experience of temporal web search. A total of 110 people was recruited and answered 18 questions regarding their recent experience of web search. Our results suggest that an interplay of seasonal interests, technicality of information needs, target time of information, re-finding behaviour, and freshness of information can be important factors for the application of temporal search. These findings should be complementary to log analyses for further development of temporally aware search engines.WebQuality'13 welcome and organization

WEBQUALITY'13 keynote talk

Measuring web quality BIBAFull-Text 1113-1114
  Ricardo Baeza-Yates
Measuring the quality of web content, either at page level or website level, is at the heart of several key challenges in the Web. Without doubt, the main one is web search, to be able to rank results. However, there are other important problems such as web reputation or trust, and web spam detection and filtering. However, measuring intrinsic web quality is a hard problem, because of our limited (automatic) understanding of text semantics, which is even worse for other media. Hence, similarly to human trust assessing, where we use past actions, face expressions, body language, etc; in the Web we need to use indirect signals that serve as surrogates for web quality. In this keynote we attempt to present the most important signals as well as new signals that are or can be used to measure quality in the Web. We divide them using the traditional web content, structure, and usage trilogy. We also characterize them according to how easy is to measure these signals, who can measure them, and how well they scale to the whole Web.

WEBQUALITY'13 web content quality session

Defending imitating attacks in web credibility evaluation systems BIBAFull-Text 1115-1122
  Xin Liu; Radoslaw Nielek; Adam Wierzbicki; Karl Aberer
Unlike traditional media such as television and newspapers, web contents are relatively easy to be published without being rigorously fact-checked. This seriously influences people's daily life if non-credible web contents are utilized for decision making. Recently, web credibility evaluation systems have emerged where web credibility is derived by aggregating ratings from the community (e.g., MyWOT). In this paper, We focus on the robustness of such systems by identifying a new type of attack scenario where an attacker imitates the behavior of trustworthy experts by copying system's credibility ratings to quickly build high reputation and then attack certain web contents. In order to defend this attack, we propose a two-stage defence algorithm. At stage 1, our algorithm applies supervised learning algorithm to predict the credibility of a web content and compare it with a user's rating to estimate whether this user is malicious or not. In case the user's maliciousness can not be determined with high confidence, the algorithm goes to stage 2 where we investigate users' past rating patterns and detect the malicious one by applying hierarchical clustering algorithm. Evaluation using real datasets demonstrates the efficacy of our approach.
Trustworthiness criteria for supporting users to assess the credibility of web information BIBAFull-Text 1123-1130
  Jarutas Pattanaphanchai; Kieron O'Hara; Wendy Hall
Assessing the quality of information on the Web is a challenging issue for at least two reasons. First, as a decentralized data publishing platform in which anyone can share nearly anything, the Web has no inherent quality control mechanisms to ensure that content published is valid, legitimate, or even just interesting. Second, when assessing the trustworthiness of web pages, users tend to base their judgments upon descriptive criteria such as the visual presentation of the website rather than more robust normative criteria such as the author's reputation and the source's review process. As a result, Web users are liable to make incorrect assessments, particularly when making quick judgments on a large scale. Therefore, Web users need credibility criteria and tools to help them assess the trustworthiness of Web information in order to place trust in it. In this paper, we investigate the criteria that can be used to collect supportive data about a piece of information in order to improve a person's ability to quickly judge the trustworthiness of the information. We propose the normative trustworthiness criteria namely, authority, currency, accuracy and relevance which can be used to support users' assessments of the trustworthiness of Web information. In addition, we validate these criteria using an expert panel. The results show that the proposed criteria are helpful. Moreover, we obtain weighting scores for criteria which can be used to calculate the trustworthiness of information and suggest a piece of information that is more likely to be trustworthy to Web users.
On the subjectivity and bias of web content credibility evaluations BIBAFull-Text 1131-1136
  Michal Kakol; Michal Jankowski-Lorek; Katarzyna Abramczuk; Adam Wierzbicki; Michele Catasta
In this paper we describe the initial outcomes of the Reconcile1 study concerning Web content credibility evaluations. The study was run with a balanced sample of 1503 respondents who independently evaluated 154 web pages from several thematic categories. Users taking part in the study not only evaluated credibility, but also filled a questionnaire covering additional respondents' traits. Using the gathered information about socio-economic status and psychological features of the users, we studied the influence of subjectivity and bias in the credibility ratings. Subjectivity and bias, in fact, represent a key design issue for Web Credibility systems, to the extent that they could jeopardize the system performance if not taken into account.
   We found out that evaluations of Web content credibility are slightly subjective. On the other hand, the evaluations exhibit a strong acquiescence bias.

WEBQUALITY'13 industry experience session

Russian web spam evolution: yandex experience BIBAFull-Text 1137-1140
  Sergey Pevtsov; Sergey Volkov
Web spam has a negative impact on the search quality and users' satisfaction and forces search engines to waste resources to crawl, index, and rank it. Thus search engines are compelled to make significant efforts in order to fight web spam. Traffic from search engines plays a great role in online economics. It causes a tough competition for high positions in search results and increases the motivation of spammers to invent new spam techniques. At the same time, ranking algorithms become more complicated, as well as web spam detection methods. So, web spam constantly evolves which makes the problem of web spam detection always relevant and challenging.
   As the most popular search engine in Russia Yandex faces the problem of web spam and has some expertise in this matter. This article describes our experience in detection different types of web spam based on content, links, clicks, and user behavior. We also review aggressive advertising and fraud because they affect the user experience. Besides, we demonstrate the connection between classic web spam and modern social engineering approaches in fraud.
Graph-based malware distributors detection BIBAFull-Text 1141-1144
  Andrei Venzhega; Polina Zhinalieva; Nikolay Suboch
Search engines are currently facing a problem of websites that distribute malware. In this paper we present a novel efficient algorithm that learns to detect such kind of spam. We have used a bipartite graph with two types of nodes, each representing a layer in the graph: web-sites and file hostings (FH), connected with edges representing the fact that a file can be downloaded from the hosting via a link on the web-site. The performance of this spam detection method was verified using two set of ground truth labels: manual assessments of antivirus analysts and automatically generated assessments obtained from antivirus companies. We demonstrate that the proposed method is able to detect new types of malware even before the best known antivirus solutions are able to detect them.
Quality-biased ranking for queries with commercial intent BIBAFull-Text 1145-1148
  Alexander Shishkin; Polina Zhinalieva; Kirill Nikolaev
Modern search engines are good enough to answer popular commercial queries with mainly highly relevant documents. However, our experiments show that users behavior on such relevant commercial sites may differ from one to another web-site with the same relevance label. Thus search engines face the challenge of ranking results that are equally relevant from the perspective of the traditional relevance grading approach. To solve this problem we propose to consider additional facets of relevance, such as trustability, usability, design quality and the quality of service. In order to let a ranking algorithm take these facets in account, we proposed a number of features, capturing the quality of a web page along the proposed dimensions. We aggregated new facets into the single label, commercial relevance, that represents cumulative quality of the site. We extrapolated commercial relevance labels for the entire learning-to-rank dataset and used weighted sum of commercial and topical relevance instead of default relevance labels. For evaluating our method we created new DCG-like metrics and conducted off-line evaluation as well as on-line interleaving experiments demonstrating that a ranking algorithm taking the proposed facets of relevance into account is better aligned with user preferences.

WEBQUALITY'13 web spam detection session

Cross-lingual web spam classification BIBAFull-Text 1149-1156
  András Garzó; Bálint Daróczy; Tamás Kiss; Dávid Siklósi; András A. Benczúr
While Web spam training data exists in English, we face an expensive human labeling procedure if we want to filter a Web domain in a different language. In this paper we overview how existing content and link based classification techniques work, how models can be "translated" from English into another language, and how language-dependent and independent methods combine. In particular we show that simple bag-of-words translation works very well and in this procedure we may also rely on mixed language Web hosts, i.e. those that contain an English translation of part of the local language text. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.
Automatically generated spam detection based on sentence-level topic information BIBAFull-Text 1157-1160
  Yoshihiko Suhara; Hiroyuki Toda; Shuichi Nishioka; Seiji Susaki
Spammers use a wide range of content generation techniques with low quality pages known as content spam to achieve their goals. We argue that content spam must be tackled using a wide range of content quality features. In this paper, we propose novel sentence-level diversity features based on the probabilistic topic model. We combine them with other content features to build a content spam classifier. Our experiments show that our method outperforms the conventional methods.WI&C'13 welcome and organization

WI&C'13 keynote talk

Exploring very large data sets from online social networks BIBAFull-Text 1165-1166
  Virgilio Almeida
The explosion in the volume of digital data currently available in social networks has created new opportunities for scientific discoveries in the realm of social media. In particular, I show our recent progress in user preference understanding, data mining, summarization and explorative analysis of very large data sets. In information networks where users send messages to one another, the issue of information overload naturally arises: which are the most important messages? Based on a very large dataset with more 54 million user accounts and with all tweets ever posted by the collected users -- more than 1.7 billion tweets, I discuss the problem of understanding the importance of messages in Twitter.
   In another work based on large-scale crawls of over 27 million user profiles that represented nearly 50% of the entire network in 2011, I show a detailed analysis of the Google+ social network. I discuss the key differences and similarities with other popular networks like Facebook and Twitter, in order to determine whether Google+ is a new paradigm or yet another social network.

WI&C'13 session 1

Animated CAPTCHAs and games for advertising BIBAFull-Text 1167-1174
  Suhas Aggarwal
In this paper, we discuss animated captcha systems which can be very useful for advertising. They are hardly any Animated Advertisement Captchas, available these days which are more secure than single image based Captchas and more fun as well. Some solutions are available such as yo!Captcha, NLPcaptcha. Solve Media TYPE-IN Captchas. These Captchas are single image based Captchas which ask users to type in Brand message to solve Captcha for brand recall. In this paper, we discuss some more appealing media which can be used for Captcha advertising. We also present Interactive Environment/Game Captcha which provide a more powerful medium for advertising. Finally, we showcase a Game with a purpose, named 'Pick Brands' which promote advertising and can be used to obtain feedback/reviews, collect user questions concerning products/Advertisements.
A large-scale, longitudinal study of user profiles in world of warcraft BIBAFull-Text 1175-1184
  Jonathan Bell; Swapneel Sheth; Gail Kaiser
We present a survey of usage of the popular Massively Multiplayer Online Role Playing Game, World of Warcraft. Players within this game often self-organize into communities with similar interests and/or styles of play. By mining publicly available data, we collected a dataset consisting of the complete player history for approximately six million characters, with partial data for another six million characters. The paper provides a thorough description of the distributed approach used to collect this massive community data set, and then focuses on an analysis of player achievement data in particular, exposing trends in play from this highly successful game. From this data, we present several findings regarding player profiles. We correlate achievements with motivations based upon a previously-defined motivation model, and then classify players based on the categories of achievements that they pursued. Experiments show players who fall within each of these buckets can play differently, and that as players progress through game content, their play style evolves as well.
Ranking factors of team success BIBAFull-Text 1185-1194
  Nataliia Pobiedina; Julia Neidhardt; Maria del Carmen Calatrava Moreno; Hannes Werthner
As an increasing number of human activities are moving to the Web, more and more teams are predominantly virtual. Therefore, formation and success of virtual teams is an important issue in a wide range of fields. In this paper we model social behavior patterns of team work using data from virtual communities. In particular, we use data about the Web community of the multiplayer online game Dota 2 to study cooperation within teams. By applying statistical analysis we investigate how and to which extent different factors of the team in the game, such as role distribution, experience, number of friends and national diversity, have an influence on the team's success. In order to complete the picture we also rank the factors according to their influence. The results of our study imply that cooperation within the team is better than competition.

WI&C'13 session 2

Autonomously reviewing and validating the knowledge base of a never-ending learning system BIBAFull-Text 1195-1204
  Saulo D. S. Pedro; Ana Paula Appel; Estevam R., Jr. Hruschka
The amount of information available on the Web has been increasing daily. However, how one might know what is right or wrong? Does the Web itself can be used as a source for verification of information? NELL (Never-Ending Language Learner) is a computer system that gathers knowledge from Web. Prophet is a link prediction component on NELL that has been successfully used to help populate its knowledge database. However, during link prediction task performance Prophet classify some edges as misplaced edges, that is, edges that we can not assure if they are right or not. In this paper we use the Web itself, using question answer (QA) systems, as a Prophet extension to validate these edges. This is an important issue when working with a self-supervised system where inserted errors might be propagate and generate dangerous concept drifting.
End-user creation of social apps by utilizing web-based social components and visual app composition BIBAFull-Text 1205-1214
  Juwel Rana; Sarwar Morshed; Kåre Synnes
This paper presents a social component framework for the SatinII App Development Environment. The environment provides a systematic way of designing, developing and deploying personalized apps and enables end-users to develop their own apps without requiring prior knowledge of programming. A wide range of social components based on the framework have been deployed in the SatinII Editor, including components that utilize aggregated social graphs to automatically create groups or recommending/filtering information. The resulting social apps are web-based and target primarily mobile clients such as smartphones. The paper also presents a classification of social components and provides an initial user-evaluation with a small group of users. Initial results indicate that social apps can be built and deployed by end-users within 17 minutes on average after 20 to 30 minutes of being introduced to the SatinII Editor.
A personalized recommender system based on users' information in folksonomies BIBAFull-Text 1215-1224
  Mohamed Nader Jelassi; Sadok Ben Yahia; Engelbert Mephu Nguifo
Thanks to the high popularity and simplicity of folksonomies, many users tend to share objects (movies, songs, bookmarks, etc.) by annotating them with a set of tags of their own choice. Users represent the core of the system since they are both the contributors and the creators of the information. Yet, each user has its own profile and its own ideas making thereby the strength as well as the weakness of folksonomies. Indeed, it would be helpful to take account of users' profile when suggesting a list of tags and resources or even a list of friends, in order to make a more personal recommandation. The goal is to suggest tags (or resources) which may correspond to a user's vocabulary or interests rather than a list of most used and popular tags in folksonomies. In this paper, we consider users' profile as a new dimension of a folksonomy classically composed of three dimensions "users, tags, ressources" and we propose an approach to group users with equivalent profiles and equivalent interests as quadratic concepts. Then, we use quadratic concepts in order to propose our personalized recommendation system of users, tags and resources according to each user's profile. Carried out experiments on the large-scale real-world filmography dataset MovieLens highlight encouraging results in terms of precision.WOLE'13 welcome and organization

WOLE'13 keynote talk

Entity search on the web BIBAFull-Text 1231-1232
  Peter Mika
More than the half of queries in the logs of a web search engine refer directly to a single named entity or a named set of entities [1]. To support entity search queries, search engines have begun developing targeted functionality, such as rich displays of factual information, question-answering and related entity recommendations. In this talk, we will provide an overview of recent work in the field of entity search, illustrated by the example of the Spark system, a large-scale system currently in use at Yahoo! for related entity recommendations in web search. Spark combines various knowledge bases and collects evidence from query logs and social media to provide the most relevant related entities for every web query with an entity intent. We discuss the methods used in Spark as well as how the system is evaluated in daily use.

WOLE'13 technical presentations

Visually extracting data records from the deep web BIBAFull-Text 1233-1238
  Neil Anderson; Jun Hong
Web sites that rely on databases for their content are now ubiquitous. Query result pages are dynamically generated from these databases in response to user-submitted queries. Automatically extracting structured data from query result pages is a challenging problem, as the structure of the data is not explicitly represented. While humans have shown good intuition in visually understanding data records on a query result page as displayed by a web browser, no existing approach to data record extraction has made full use of this intuition. We propose a novel approach, in which we make use of the common sources of evidence that humans use to understand data records on a displayed query result page. These include structural regularity, and visual and content similarity between data records displayed on a query result page. Based on these observations we propose new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts. We have implemented these techniques in a software prototype, rExtractor, and tested it using two datasets. Our experimental results show that our approach achieves significantly higher accuracy than previous approaches. Furthermore, it establishes the case for use of vision-based algorithms in the context of data extraction from web sites.
Can we use linked data semantic annotators for the extraction of domain-relevant expressions? BIBAFull-Text 1239-1246
  Michel Gagnon; Amal Zouaq; Ludovic Jean-Louis
Semantic annotation is the process of identifying expressions in texts and linking them to some semantic structure. In particular, Linked data-based Semantic Annotators are now becoming the new Holy Grail for meaning extraction from unstructured documents. This paper presents an evaluation of the main linked data-based annotators available with a focus on domain topics and named entities. In particular, we compare the ability of each tool to annotate relevant domain expressions in text. The paper also proposes a combination of annotators through voting methods and machine learning. Our results show that some linked-data annotators, especially Alchemy, can be considered as a useful resource for topic extraction. They also show that a substantial increase in recall can be achieved by combining the annotators with a weighted voting scheme. Finally, an interesting result is that by removing Alchemy from the combination, or by combining only the more precise annotators, we get a significant increase in precision, at the cost of a lower recall.
Course-specific search engines: semi-automated methods for identifying high quality topic-specific corpora BIBAFull-Text 1247-1252
  Neel Guha; Matt Wytock
Web search is an important research tool for many high school courses. However, generic search engines have a number of problems that arise out of not understanding the context of search (the high school course), leading to results that are off-topic or inappropriate as reference material. In this paper, we introduce the concept of a course-specific search engine and build such a search engine for the Advanced Placement US History (APUSH) course; the results of which are preferred by subject matter experts (high school teachers) over existing search engines. This reference search engine for APUSH relies on a hand-curated set of sites picked specifically for this educational context. In order to automate this expensive process, we describe two algorithms for indentifying high quality topical sites using an authoritative source such as a textbook: one based on textual similarity and another using structured data from knowledge bases. Initial experimental results indicate that these algorithms can successfully classify high quality documents leading to the automatic creation of topic-specific corpora for any course.
Using SKOS vocabularies for improving web search BIBAFull-Text 1253-1258
  Bernhard Haslhofer; Flávio Martins; João Magalhães
Knowledge organization systems such as thesauri or taxonomies are increasingly being expressed using the Simple Knowledge Organization System (SKOS) and published as structured data on the Web. Search engines can exploit these vocabularies and improve search by expanding terms at query or document indexing time. We propose a SKOS-based term expansion and scoring technique that leverages labels and semantic relationships of SKOS concept definitions. We also implemented this technique for Apache Lucene and Solr. Experiments with the Medical Subject Headings vocabulary and an early evaluation with Library of Congress Subject Headings indicated gains in precision when using SKOS-based expansion compared to pseudo relevance feedback and no expansion. Our findings are important for publishers and consumer of Web vocabularies who want to use them for improving search over Web documents.
@i seek 'fb.me': identifying users across multiple online social networks BIBAFull-Text 1259-1268
  Paridhi Jain; Ponnurangam Kumaraguru; Anupam Joshi
An online user joins multiple social networks in order to enjoy different services. On each joined social network, she creates an identity and constitutes its three major dimensions namely profile, content and connection network. She largely governs her identity formulation on any social network and therefore can manipulate multiple aspects of it. With no global identifier to mark her presence uniquely in the online domain, her online identities remain unlinked, isolated and difficult to search. Literature has proposed identity search methods on the basis of profile attributes, but has left the other identity dimensions e.g. content and network, unexplored. In this work, we introduce two novel identity search algorithms based on content and network attributes and improve on traditional identity search algorithm based on profile attributes of a user. We apply proposed identity search algorithms to find a user's identity on Facebook, given her identity on Twitter. We report that a combination of proposed identity search algorithms found Facebook identity for 39% of Twitter users searched while traditional method based on profile attributes found Facebook identity for only 27.4%. Each proposed identity search algorithm access publicly accessible attributes of a user on any social network. We deploy an identity resolution system, Finding Nemo, which uses proposed identity search methods to find a Twitter user's identity on Facebook. We conclude that inclusion of more than one identity search algorithm, each exploiting distinct dimensional attributes of an identity, helps in improving the accuracy of an identity resolution process.
Search result presentation: supporting post-search navigation by integration of taxonomy data BIBAFull-Text 1269-1274
  Matthias Keller; Patrick Mühlschlegel; Hannes Hartenstein
As a result of additional semantic annotations and novel mining methods, Web site taxonomies are more and more available to machines, including search engines. Recent research shows that after a search result is clicked, users often continue navigating on the destination site because in many cases a single document cannot satisfy the information need. The role Web site taxonomies play in this post-search navigation phase has not yet been researched. In this paper we analyze in an empirical study of three highly-frequented Web sites how Web site taxonomies influence the next browsing steps of users arriving from a search engine. The study reveals that users not randomly explore the destination site, but proceed to the direct child nodes of the landing page with significantly higher frequency compared to the other linked pages. We conclude that the common post-search navigation strategy in taxonomies is to descend towards more specific results. The study has interesting implications for the presentation of search results. Current search engines focus on summarizing the linked document only. In doing so, search engines ignore the fact the linked documents are in many cases just the starting point for further navigation. Based on the observed post-search navigation strategy, we propose to include information about child nodes of linked documents in the presentation of search results. Users would benefit by saving clicks, because they could not only estimate whether the linked document provides useful information, but also whether post-search navigation is promising.
RESLVE: leveraging user interest to improve entity disambiguation on short text BIBAFull-Text 1275-1284
  Elizabeth L. Murnane; Bernhard Haslhofer; Carl Lagoze
We address the Named Entity Disambiguation (NED) problem for short, user-generated texts on the social Web. In such settings, the lack of linguistic features and sparse lexical context result in a high degree of ambiguity and sharp performance drops of nearly 50% in the accuracy of conventional NED systems. We handle these challenges by developing a model of user-interest with respect to a personal knowledge context; and Wikipedia, a particularly well-established and reliable knowledge base, is used to instantiate the procedure. We conduct systematic evaluations using individuals' posts from Twitter, YouTube, and Flickr and demonstrate that our novel technique is able to achieve substantial performance gains beyond state-of-the-art NED methods.
SEED: a framework for extracting social events from press news BIBAFull-Text 1285-1294
  Salvatore Orlando; Francesco Pizzolon; Gabriele Tolomei
Everyday people are exchanging a huge amount of data through the Internet. Mostly, such data consist of unstructured texts, which often contain references to structured information (e.g., person names, contact records, etc.). In this work, we propose a novel solution to discover social events from actual press news edited by humans. Concretely, our method is divided in two steps, each one addressing a specific Information Extraction (IE) task: first, we use a technique to automatically recognize four classes of named-entities from press news: DATE, LOCATION, PLACE, and ARTIST. Furthermore, we detect social events by extracting ternary relations between such entities, also exploiting evidence from external sources (i.e., the Web). Finally, we evaluate both stages of our proposed solution on a real-world dataset. Experimental results highlight the quality of our first-step Named-Entity Recognition (NER) approach, which indeed performs consistently with state-of-the-art solutions. Eventually, we show how to precisely select true events from the list of all candidate events (i.e., all the ternary relations), which result from our second-step Relation Extraction (RE) method. Indeed, we discover that true social events can be detected if enough evidence of those is found in the result list of Web search engines.
Classifying YouTube channels: a practical system BIBAFull-Text 1295-1304
  Vincent Simonet
This paper presents a framework for categorizing channels of videos in a thematic taxonomy with high precision and coverage. The proposed approach consists of three main steps.First, videos are annotated by semantic entities describing their central topics. Second, semantic entities are mapped to categories using a combination of classifiers.Last, the categorization of channels is obtained by combining the results of both previous steps.
   This framework has been deployed on the whole corpus of YouTube, in 8 languages, and used to build several user facing products. Beyond the description of the framework, this paper gives insight into practical aspects and experience: rationale from product requirements to the choice of the solution, spam filtering, human-based evaluations of the quality of the results, and measured metrics on the live site.WOW'13 welcome and organization

WOW'13 technical presentations

The economics of data: quality, value & exchange in web observatories BIBAFull-Text 1309-1316
  Paul Booth; Paul Gaskell; Chris Hughes
The aim of this paper is to present a requirement for assessing the quality of data and the development of efficient methods of valuing and exchanging data among Web Observatories. Using economic and business theory a range of concepts are explored which include a brief review of existing business structures related to the exchange of goods, data or otherwise. The paper calls for a wider discussion by the Web Observatory community to begin to define relevant criteria by which data can be assessed and improved over time. The economic incentives are addressed as part of a price by proxy framework we introduce, which is supported by the need to strive for clear pricing signals and the reduction of information asymmetries. What is presented here is a way of establishing and improving data quality with a view to valuing data exchanges that does not require the presence of money in the transaction, yet it remains tied to revenue generation models as they exist online.
From search to observation BIBAFull-Text 1317-1320
  Ian Brown; Wendy Hall; Lisa Harris
In this paper, we propose a set of concepts underlying the process and requirements of observation: that is, the process of employing web observatories for research. We refer to observation as a new concept, distinct from search, which we believe is worthy of study in its own right and note that the process of observation moves the focus of information retrieval away from universal coverage and towards improved quality of results and thus has many potential facets not necessarily present in traditional search.
Living analytics methods for the web observatory BIBAFull-Text 1321-1324
  Ernesto Diaz-Aviles
The collective effervescence of social media production has been enjoying a great deal of success in recent years. The hundred of millions of users who are actively participating in the Social Web are exposed to ever-growing amounts of sites, relationships, and information.
   In this paper, we report part of the efforts towards the realization of a Web Observatory at the L3S Research Center (www.L3S.de). In particular, we present our approach based on Living Analytics methods, whose main goal is to capture people interactions in real-time and to analyze multidimensional relationships, metadata, and other data becoming ubiquitous in the social web, in order to discover the most relevant and attractive information to support observation, understanding and analysis of the Web. We center the discussion on two areas: (i) Recommender Systems for Big Fast Data and (ii) Collective Intelligence, both key components towards an analytics toolbox for our Web Observatory.
Exploration in web science: instruments for web observatories BIBAFull-Text 1325-1328
  Marie Joan Kristine Gloria; Deborah L. McGuinness; Joanne S. Luciano; Qingpeng Zhang
The following contribution highlights selected work conducted by Rensselaer Polytechnic Institute's Web Science Research Center. (RPI WSRC). Specifically, it brings to light four different themed Web Observatories -- Science Data, Health and Life Sciences, Open Government, and Social Spaces. Each of these observatories serves as a repository of data, tools, and methods that help answer complicated questions in each of these research areas. We present six case studies featuring tools and methods developed by RPI WSRC to aide in the exploration, discovery, and analysis of large data sets. These case studies along with our web observatory developments are aimed to increase our understanding of web science in general and to serve as test beds for our research.
From health-persona to societal health BIBAFull-Text 1329-1334
  Ramesh Jain; Laleh Jalali; Mingming Fan
In this position paper, we propose an approach for Web Observatories that builds on using social media, personal data, and sensors to build Persona for an individual, but also use this data and the concept of Focused Micro Blogs (FMB) for situation detection, helping individual using situation action rules, and finally gaining insights for obtaining insights about society. We demonstrate this in a concrete use case of fitness and health care related sensors for building health persona and using this for understanding societal health issues.
Understanding the diversity of tweets in the time of outbreaks BIBAFull-Text 1335-1342
  Nattiya Kanhabua; Wolfgang Nejdl
A microblogging service like Twitter continues to surge in importance as a means of sharing information in social networks. In the medical domain, several works have shown the potential of detecting public health events (i.e., infectious disease outbreaks) using Twitter messages or tweets. Given its real-time nature, Twitter can enhance early outbreak warning for public health authorities in order that a rapid response can take place. Most of previous works on detecting outbreaks in Twitter simply analyze tweets matched disease names and/or locations of interests. However, the effectiveness of such method is limited for two main reasons. First, disease names are highly ambiguous, i.e., referring slangs or non health-related contexts. Second, the characteristics of infectious diseases are highly dynamic in time and place, namely, strongly time-dependent and vary greatly among different regions. In this paper, we propose to analyze the temporal diversity of tweets during the known periods of real-world outbreaks in order to gain insight into a temporary focus on specific events. More precisely, our objective is to understand whether the temporal diversity of tweets can be used as indicators of outbreak events, and to which extent. We employ an efficient algorithm based on sampling to compute the diversity statistics of tweets at particular time. To this end, we conduct experiments by correlating temporal diversity with the estimated event magnitude of 14 real-world outbreak events manually created as ground truth. Our analysis shows that correlation results are diverse among different outbreaks, which can reflect the characteristics (severity and duration) of outbreaks.
KONECT: the Koblenz network collection BIBAFull-Text 1343-1350
  Jérôme Kunegis
We present the Koblenz Network Collection (KONECT), a project to collect network datasets in the areas of web science, network science and related areas, as well as provide tools for their analysis. In the cited areas, a surprisingly large number of very heterogeneous data can be modeled as networks and consequently, a unified representation of networks can be used to gain insight into many kinds of problems. Due to the emergence of the World Wide Web in the last decades many such datasets are now openly available. The KONECT project thus has the goal of collecting many diverse network datasets from the Web, and providing a way for their systematic study. The main parts of KONECT are (1) a collection of over 160 network datasets, consisting of directed, undirected, unipartite, bipartite, weighted, unweighted, signed and temporal networks collected from the Web, (2) a Matlab toolbox for network analysis and (3) a website giving a compact overview the various computed statistics and plots. In this paper, we describe KONECT's taxonomy of networks datasets, give an overview of the datasets included, review the supported statistics and plots, and briefly discuss KONECT's role in the area of web science and network science.
Design and prototyping of a social media observatory BIBAFull-Text 1351-1358
  Karissa McKelvey; Filippo Menczer
The broad adoption of online social networking platforms has made it possible to study communication networks at an unprecedented scale. With social media and micro-blogging platforms such as Twitter, we can observe high-volume data streams of online discourse. However, it is a challenge to collect, manage, analyze, visualize, and deliver large amounts of data, even by experts in the computational sciences. In this paper, we describe our recent extensions to Truthy, a social media observatory that collects and analyzes discourse on Twitter dating from August 2010. We introduce several interactive visualizations and analytical tools with the goal of enabling researchers to study online social networks with mixed methods at multiple scales. We present design considerations and a prototype for integrating social media observatories as important components of a web observatory framework.
EventShop: recognizing situations in web data streams BIBAFull-Text 1359-1368
  Siripen Pongpaichet; Vivek K. Singh; Mingyan Gao; Ramesh Jain
Web Observatories must address fundamental societal challenges using enormous volumes of data being created due to the significant progress in technology. The proliferation of heterogeneous data streams generated by social media, sensor networks, internet of things, and digitalization of transactions in all aspect of humans? life presents an opportunity to establish a new era of networks called Social Life Networks (SLN). The main goal of SLN is to connect People to Resources effectively, efficiently, and promptly in given Situations. Towards this goal, we present a computing framework, called EventShop, to recognize evolving situations from massive web streams in real-time. These web streams can be fundamentally considered as spatio-temporal-thematic streams and can be combined using a set of generic spatio-temporal analysis operators to recognize evolving situations. Based on the detected situations, the relevant information and alerts can be provided to both individuals and organizations. Several examples from the real world problems have been developed to test the efficacy of EventShop framework.
SemantEco: a next-generation web observatory BIBAFull-Text 1369-1372
  A. Patrice Seyed; Tim Lebo; Evan Patton; Jim McCusker; Deborah McGuinness
A web observatory for empirical research of Web data benefits from software frameworks that are modular, has a clear underlying semantic model, and that includes metadata enabling a trace and inspection of the source data and justifications for derived datasets. We present SemantEco as an architecture that can serve as an exemplar abstraction for infrastructure design and metadata based on best practices in Semantic Web, Provenance, and Software Engineering, that can be employed in any Web Observatory, that may grow out of a community. We will describe how the SemantEco framework allows for searching, visualizing, and tracing a wide variety of data.
An approach for using Wikipedia to measure the flow of trends across countries BIBAFull-Text 1373-1378
  Ramine Tinati; Thanassis Tiropanis; Lesie Carr
Wikipedia has grown to become the most successful online encyclopedia on the Web, containing over 24 million articles, offered in over 240 languages. In just over 10 years Wikipedia has transformed from being just an encyclopedia of knowledge, to a wealth of facts and information, from articles discussing trivia, political issues, geographies and demographics, to popular culture, news articles, and social events. In this paper we explore the use of Wikipedia for identifying the flow of information and trends across the world. We start with the hypothesis that, given that Wikipedia is a resource that is globally available in different languages across countries, access to its articles could be a reflection human activity. To explore this hypothesis we try to establish metrics on the use of Wikipedia in order to identify potential trends and to establish whether or how those trends flow from one county to another. We subsequently compare the outcome of this analysis to that of more established methods that are based on online social media or traditional media. We explore this hypothesis by applying our approach to a subset of Wikipedia articles and also a specific worldwide social phenomenon that occurred during 2012; we investigate whether access to relevant Wikipedia articles correlates to the viral success of the South Korean pop song, "Gangnam Style" and the associated artist "PSY" as evidenced by traditional and online social media. Our analysis demonstrates that Wikipedia can indeed provide a useful measure for detecting social trends and events, and in the case that we studied; it could have been possible to identify the specific trend quicker in comparison to other established trend identification services such as Google Trends.
A glance at an overlooked part of the world wide web BIBAFull-Text 1379-1386
  Ionut Trestian; Chunjing Xiao; Aleksandar Kuzmanovic
Although according to surveys related to internet user activity it is considered one of the most popular aspects, few studies are actually concerned with internet pornography. This paper is aimed at rectifying that overlook. In particular, we study user activity related to internet pornography by looking at two main behaviors: (i) watching pornography, and (ii) providing feedback on pornography items in the form of ratings and comments.
   By using appropriate datasets that we collect, we make contributions related to the study of both behaviors pointed out above. With regards to viewing, we observe that views are highly dependent on pornography category and video size. By studying the feedback system of pornography video websites, we observe differences in the way users rate items across websites popular in different parts of the world. Finally, we employ sentiment analysis to study the comments that users leave on pornography websites and we find surprising similarities across the analyzed websites. Our results pave the way to understanding more about human behavior related to internet pornography and can impact, among others, fields such as content personalization, video content delivery, recommender systems.

WS-REST'13 technical presentations

A concept for generating simplified RESTful interfaces BIBAFull-Text 1391-1398
  Markus Gulden; Stefan Kugele
Today, innovative companies are forced to evolve their software systems faster and faster, either for providing customer services and products or for supporting internal processes. At the same time, already existing, maybe even legacy systems are crucial for different reasons and by that cannot be abolished easily. While integrating legacy software into new systems in general is considered by well-known approaches like SOA (service-oriented architecture), at the best of our knowledge, it lacks of ways to make legacy systems available for remote clients like smart phones or embedded devices.
   In this paper, we propose an approach to leverage heterogeneous (legacy) applications by adding RESTful web-based interfaces in a model-driven way. We introduce an additional application layer, which encapsulates services of one or several existing applications, and provides a unified, web-based, and seamless interface. This interface is modelled in our own DSL (domain-specific language), the belonging code generator produces productive Java code. Finally, we report on an case study proving our concept by means of an e-bike sharing service.
Distributed affordance: an open-world assumption for hypermedia BIBAFull-Text 1399-1406
  Ruben Verborgh; Michael Hausenblas; Thomas Steiner; Erik Mannens; Rik Van de Walle
Hypermedia links and controls drive the Web by transforming information into affordances through which users can choose actions. However, publishers of information cannot predict all actions their users might want to perform and therefore, hypermedia can only serve as the engine of application state to the extent the user's intentions align with those envisioned by the publisher. In this paper, we introduce distributed affordance, a concept and architecture that extends application state to the entire Web. It combines information inside the representation with knowledge of action providers to generate affordance from the user's perspective. Unlike similar approaches such as Web Intents, distributed affordance scales both in the number of actions and the number of action providers, because it is resource-oriented instead of action-oriented. A proof-of-concept shows that distributed affordance is a feasible strategy on today's Web.
A framework for self-descriptive RESTful services BIBAFull-Text 1407-1414
  Luca Panziera; Flavio De Paoli
REST principles define services as resources that can be manipulated by a set of well-known methods. The same approach is suitable to define service descriptions as resources. In this paper, we try to unify the two concepts (services and their descriptions) by proposing a set of best practices to build self-descriptive RESTful services accessible by both humans and machines. Moreover, to make those practices usable with little manual effort, we provide a software framework that extracts compliant descriptions from documents published on the Web, and makes them available to clients as resources.
Model your application domain, not your JSON structures BIBAFull-Text 1415-1420
  Markus Lanthaler; Christian Gütl
Creating truly RESTful Web APIs is still more an art than a science. Developers have to struggle with a number of complex design decisions because concrete guidelines and processes are missing. Consequently, often it is decided to implement the simplest solution which is, most of the time, to rely on out-of-band contracts between the client and the server. Instead of properly modeling the application domain, all the effort is put in the design of proprietary JSON structures and URLs. This then forms the base for the contract which is communicated in natural-language (with all its ambiguity) to client developers. Since it is the server who owns the contract it may be changed at any point, which, more often than not, results in broken clients. In this position paper, we discuss some of the challenges and choices that need to be made when designing RESTful Web APIs. In particular, we compare how contracts are supposed to be established and how they are defined in practice. We illustrate the problems that are the cause of these divergences. As a first step to address these issues we describe and motivate an alternative, domain-driven approach to design Web APIs.