HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 04-205-105-2060708091011-111-212-112-213-113-214-114-215-115-2

Proceedings of the 2012 International Conference on the World Wide Web

Fullname:Proceedings of the 21st International Conference on World Wide Web
Editors:Alain Mille; Fabien Gandon; Jacques Misselis; Michael Rabinovich; Steffen Staab
Location:Lyon, France
Dates:2012-Apr-16 to 2012-Apr-20
Standard No:ISBN 1-4503-1229-2, 978-1-4503-1229-5; ACM DL: Table of Contents hcibib: WWW12-1
Links:Conference Home Page
Summary:It is our great pleasure to present you the technical programme of WWW 2012.
    The call for papers attracted a breath-taking number of 885 full paper submissions. 521 PC members and 235 supporting reviewers evaluated these papers. Based on 3247 reviews and meta-reviews as well as many discussions held online, the 27 track chairs and deputy chairs met with us for two full days in order to discuss and decide upon final acceptance of full papers. In the end, 108 papers -- only 12% of the submissions -- could be accommodated in the technical programme.
    These papers represent a well-balanced mix over the 13 tracks listed in the call for papers. Though the different tracks received very different numbers of submissions, our principal selection criterion was quality; hence, we neither enforced a proportional share of acceptances among the tracks nor did we favor tracks with lower numbers of submissions.
  1. WWW 2012-04-16 Volume 1
    1. Recommender systems
    2. Mobile Web Performance
    3. Security and fraud in social networks
    4. Advertising on the web 1
    5. Semantic web approaches in search
    6. Web performance
    7. Fraud and bias in user ratings
    8. Mobile and file-sharing users
    9. Behavioral analysis and content characterization in social media
    10. Creating and using links between data objects
    11. Security 1
    12. Community detection in social networks
    13. Advertising on the web 2
    14. Search
    15. Obtaining and leveraging user comments
    16. Entity linking
    17. Search: evaluation and ranking
    18. Information diffusion in social networks
    19. Game theory and the web
    20. Leveraging user actions in search
    21. Web user behavioral analysis and modeling
    22. Ontology representation and querying: RDF and SPARQL
    23. Security 2
    24. Social interactions and the web
    25. Entity and taxonomy extraction
    26. Leveraging user-generated content
    27. Data and content management 1
    28. Web engineering 1
    29. Collaboration in social networks
    30. Information extraction
    31. Web mining
    32. Data and content management 2
    33. Web engineering 2
    34. Crowdsourcing
    35. Social networks
    36. User interfaces and human factors
  2. WWW 2012-04-16 Volume 2
    1. Industry track presentations
    2. PhD Symposium
    3. European track presentations
    4. Demonstrations
    5. Poster presentations
    6. SWDM'12 workshop 1
    7. XperienceWeb'12 Workshop 2
    8. CQA'12 workshop 3
    9. EMAIL'12 workshop 4
    10. AdMIRe'12 workshop 6
    11. MultiAPro'12 workshop 7
    12. LSNA'12 workshop 8
    13. SWCS'12 workshop 9
    14. MSND'12 workshop 10

WWW 2012-04-16 Volume 1

Recommender systems

Build your own music recommender by modeling internet radio streams BIBAFull-Text 1-10
  Natalie Aizenberg; Yehuda Koren; Oren Somekh
In the Internet music scene, where recommendation technology is key for navigating huge collections, large market players enjoy a considerable advantage. Accessing a wider pool of user feedback leads to an increasingly more accurate analysis of user tastes, effectively creating a "rich get richer" effect. This work aims at significantly lowering the entry barrier for creating music recommenders, through a paradigm coupling a public data source and a new collaborative filtering (CF) model. We claim that Internet radio stations form a readily available resource of abundant fresh human signals on music through their playlists, which are essentially cohesive sets of related tracks. In a way, our models rely on the knowledge of a diverse group of experts in lieu of the commonly used wisdom of crowds. Over several weeks, we aggregated publicly available playlists of thousands of Internet radio stations, resulting in a dataset encompassing millions of plays, and hundreds of thousands of tracks and artists. This provides the large scale ground data necessary to mitigate the cold start problem of new items at both mature and emerging services.
   Furthermore, we developed a new probabilistic CF model, tailored to the Internet radio resource. The success of the model was empirically validated on the collected dataset. Moreover, we tested the model at a cross-source transfer learning manner -- the same model trained on the Internet radio data was used to predict behavior of Yahoo! Music users. This demonstrates the ability to tap the Internet radio signals in other music recommendation setups. Based on encouraging empirical results, our hope is that the proposed paradigm will make quality music recommendation accessible to all interested parties in the community.
Using control theory for stable and efficient recommender systems BIBAFull-Text 11-20
  Tamas Jambor; Jun Wang; Neal Lathia
The aim of a web-based recommender system is to provide highly accurate and up-to-date recommendations to its users; in practice, it will hope to retain its users over time. However, this raises unique challenges. To achieve complex goals such as keeping the recommender model up-to-date over time, we need to consider a number of external requirements. Generally, these requirements arise from the physical nature of the system, for instance the available computational resources. Ideally, we would like to design a system that does not deviate from the required outcome. Modeling such a system over time requires to describe the internal dynamics as a combination of the underlying recommender model and the its users' behavior. We propose to solve this problem by applying the principles of modern control theory -- a powerful set of tools to deal with dynamical systems -- to construct and maintain a stable and robust recommender system for dynamically evolving environments. In particular, we introduce a design principle by focusing on the dynamic relationship between the recommender system's performance and the number of new training samples the system requires. This enables us to automate the control other external factors such as the system's update frequency. We show that, by using a Proportional-Integral-Derivative controller, a recommender system is able to automatically and accurately estimate the required input to keep the output close to a pre-defined requirements. Our experiments on a standard rating dataset show that, by using a feedback loop between system performance and training, the trade-off between the effectiveness and efficiency of the system can be well maintained. We close by discussing the widespread applicability of our approach to a variety of scenarios that recommender systems face.
An exploration of improving collaborative recommender systems via user-item subgroups BIBAFull-Text 21-30
  Bin Xu; Jiajun Bu; Chun Chen; Deng Cai
Collaborative filtering (CF) is one of the most successful recommendation approaches. It typically associates a user with a group of like-minded users based on their preferences over all the items, and recommends to the user those items enjoyed by others in the group. However we find that two users with similar tastes on one item subset may have totally different tastes on another set. In other words, there exist many user-item subgroups each consisting of a subset of items and a group of like-minded users on these items. It is more natural to make preference predictions for a user via the correlated subgroups than the entire user-item matrix. In this paper, to find meaningful subgroups, we formulate the Multiclass Co-Clustering (MCoC) problem and propose an effective solution to it. Then we propose an unified framework to extend the traditional CF algorithms by utilizing the subgroups information for improving their top-N recommendation performance. Our approach can be seen as an extension of traditional clustering CF models. Systematic experiments on three real world data sets have demonstrated the effectiveness of our proposed approach.

Mobile Web Performance

How far can client-only solutions go for mobile browser speed? BIBAFull-Text 31-40
  Zhen Wang; Felix Xiaozhu Lin; Lin Zhong; Mansoor Chishtie
Mobile browser is known to be slow because of the bottleneck in resource loading. Client-only solutions to improve resource loading are attractive because they are immediately deployable, scalable, and secure. We present the first publicly known treatment of client-only solutions to understand how much they can improve mobile browser speed without infrastructure support. Leveraging an unprecedented set of web usage data collected from 24 iPhone users continuously over one year, we examine the three fundamental, orthogonal approaches a client-only solution can take: caching, prefetching, and speculative loading. Speculative loading, as is firstly proposed and studied in this work, predicts and speculatively loads the subresources needed to open a webpage once its URL is given. We show that while caching and prefetching are highly limited for mobile browsing, speculative loading can be significantly more effective. Empirically, we show that client-only solutions can improve the browser speed by about 1.4 second on average for websites visited by the 24 iPhone users. We also report the design, realization, and evaluation of speculative loading in a WebKit-based browser called Tempo. On average, Tempo can reduce browser delay by 1 second (~20%).
Who killed my battery?: analyzing mobile browser energy consumption BIBAFull-Text 41-50
  Narendran Thiagarajan; Gaurav Aggarwal; Angela Nicoara; Dan Boneh; Jatinder Pal Singh
Despite the growing popularity of mobile web browsing, the energy consumed by a phone browser while surfing the web is poorly understood. We present an infrastructure for measuring the precise energy used by a mobile browser to render web pages. We then measure the energy needed to render financial, e-commerce, email, blogging, news and social networking sites. Our tools are sufficiently precise to measure the energy needed to render individual web elements, such as cascade style sheets (CSS), Javascript, images, and plug-in objects. Our results show that for popular sites, downloading and parsing cascade style sheets and Javascript consumes a significant fraction of the total energy needed to render the page. Using the data we collected we make concrete recommendations on how to design web pages so as to minimize the energy needed to render the page. As an example, by modifying scripts on the Wikipedia mobile site we reduced by 30% the energy needed to download and render Wikipedia pages with no change to the user experience. We conclude by estimating the point at which offloading browser computations to a remote proxy can save energy on the phone.
Periodic transfers in mobile applications: network-wide origin, impact, and optimization BIBAFull-Text 51-60
  Feng Qian; Zhaoguang Wang; Yudong Gao; Junxian Huang; Alexandre Gerber; Zhuoqing Mao; Subhabrata Sen; Oliver Spatscheck
Cellular networks employ a specific radio resource management policy distinguishing them from wired and Wi-Fi networks. A lack of awareness of this important mechanism potentially leads to resource-inefficient mobile applications. We perform the first network-wide, large-scale investigation of a particular type of application traffic pattern called periodic transfers where a handset periodically exchanges some data with a remote server every t seconds. Using packet traces containing 1.5 billion packets collected from a commercial cellular carrier, we found that periodic transfers are very prevalent in today's smartphone traffic. However, they are extremely resource-inefficient for both the network and end-user devices even though they predominantly generate very little traffic. This somewhat counter-intuitive behavior is a direct consequence of the adverse interaction between such periodic transfer patterns and the cellular network radio resource management policy. For example, for popular smartphone applications such as Facebook, periodic transfers account for only 1.7% of the overall traffic volume but contribute to 30% of the total handset radio energy consumption. We found periodic transfers are generated for various reasons such as keep-alive, polling, and user behavior measurements. We further investigate the potential of various traffic shaping and resource control algorithms. Depending on their traffic patterns, applications exhibit disparate responses to optimization strategies. Jointly using several strategies with moderate aggressiveness can eliminate almost all energy impact of periodic transfers for popular applications such as Facebook and Pandora.

Security and fraud in social networks

Understanding and combating link farming in the twitter social network BIBAFull-Text 61-70
  Saptarshi Ghosh; Bimal Viswanath; Farshad Kooti; Naveen Kumar Sharma; Gautam Korlam; Fabricio Benevenuto; Niloy Ganguly; Krishna Phani Gummadi
Recently, Twitter has emerged as a popular platform for discovering real-time information on the Web, such as news stories and people's reaction to them. Like the Web, Twitter has become a target for link farming, where users, especially spammers, try to acquire large numbers of follower links in the social network. Acquiring followers not only increases the size of a user's direct audience, but also contributes to the perceived influence of the user, which in turn impacts the ranking of the user's tweets by search engines.
   In this paper, we first investigate link farming in the Twitter network and then explore mechanisms to discourage the activity. To this end, we conducted a detailed analysis of links acquired by over 40,000 spammer accounts suspended by Twitter. We find that link farming is wide spread and that a majority of spammers' links are farmed from a small fraction of Twitter users, the social capitalists, who are themselves seeking to amass social capital and links by following back anyone who follows them. Our findings shed light on the social dynamics that are at the root of the link farming problem in Twitter network and they have important implications for future designs of link spam defenses. In particular, we show that a simple user ranking scheme that penalizes users for connecting to spammers can effectively address the problem by disincentivizing users from linking with other users simply to gain influence.
Analyzing spammers' social networks for fun and profit: a case study of cyber criminal ecosystem on twitter BIBAFull-Text 71-80
  Chao Yang; Robert Harkreader; Jialong Zhang; Seungwon Shin; Guofei Gu
In this paper, we perform an empirical analysis of the cyber criminal ecosystem on Twitter. Essentially, through analyzing inner social relationships in the criminal account community, we find that criminal accounts tend to be socially connected, forming a small-world network. We also find that criminal hubs, sitting in the center of the social graph, are more inclined to follow criminal accounts. Through analyzing outer social relationships between criminal accounts and their social friends outside the criminal account community, we reveal three categories of accounts that have close friendships with criminal accounts. Through these analyses, we provide a novel and effective criminal account inference algorithm by exploiting criminal accounts' social relationships and semantic coordinations.
Branded with a scarlet "C": cheaters in a gaming social network BIBAFull-Text 81-90
  Jeremy Blackburn; Ramanuja Simha; Nicolas Kourtellis; Xiang Zuo; Matei Ripeanu; John Skvoretz; Adriana Iamnitchi
Online gaming is a multi-billion dollar industry that entertains a large, global population. One unfortunate phenomenon, however, poisons the competition and the fun: cheating. The costs of cheating span from industry-supported expenditures to detect and limit cheating, to victims' monetary losses due to cyber crime. This paper studies cheaters in the Steam Community, an online social network built on top of the world's dominant digital game delivery platform. We collected information about more than 12 million gamers connected in a global social network, of which more than 700 thousand have their profiles flagged as cheaters. We also collected in-game interaction data of over 10 thousand players from a popular multiplayer gaming server. We show that cheaters are well embedded in the social and interaction networks: their network position is largely indistinguishable from that of fair players. We observe that the cheating behavior appears to spread through a social mechanism: the presence and the number of cheater friends of a fair player is correlated with the likelihood of her becoming a cheater in the future. Also, we observe that there is a social penalty involved with being labeled as a cheater: cheaters are likely to switch to more restrictive privacy settings once they are tagged and they lose more friends than fair players. Finally, we observe that the number of cheaters is not correlated with the geographical, real-world population density, or with the local popularity of the Steam Community.

Advertising on the web 1

Risk-aware revenue maximization in display advertising BIBAFull-Text 91-100
  Ana Radovanovic; William D. Heavlin
Display advertising is the graphical advertising on the World Wide Web (WWW) that appears next to content on web pages, instant messaging (IM) applications, email, etc. Over the past decade, display ads have evolved from simple banner and pop-up ads to include various combinations of text, images, audio, video, and animations. As a market segment, display continues to show substantial growth potential, as evidenced by companies such as Microsoft, Yahoo, and Google actively vying for market share. As a sales process, display ads are typically sold in packages, the result of negotiations between sales and advertising agents. A key component to any successful business model in display advertising is sound pricing. Main objectives for on-line publishers (e.g. Amazon, YouTube, CNN) are maximizing revenue while managing their available inventory appropriately, and pricing must reflect these considerations.
   This paper addresses the problem of maximizing revenue by adjusting prices of display inventory. We cast this as an inventory allocation problem. Our formal objective (a) maximizes revenue using (b) iterative price adjustments in the direction of the gradient of an appropriately constructed Lagrangian relaxation. We show that our optimization approach drives the revenue towards local maximum under mild conditions on the properties of the (unknown) demand curve.
   The major unknown for optimizing revenue in display environment is how the demand for display ads changes to prices, the classical demand curve. This we address directly, by way of a factorial pricing experiment. This enables us to estimate the gradient of the revenue function with respect to inventory prices. Overall, the result is a principled, risk-aware, and empirically efficient methodology.
   This paper is based on research undertaken on behalf of one of Google's clients.
Targeting converters for new campaigns through factor models BIBAFull-Text 101-110
  Deepak Agarwal; Sandeep Pandey; Vanja Josifovski
In performance based display advertising, campaign effectiveness is often measured in terms of conversions that represent some desired user actions like purchases and product information requests on advertisers' website. Hence, identifying and targeting potential converters is of vital importance to boost campaign performance. This is often accomplished by marketers who define the user base of campaigns based on behavioral, demographic, search, social, purchase, and other characteristics. Such a process is manual and subjective, it often fails to utilize the full potential of targeting. In this paper we show that by using past converted users of campaigns and campaign meta-data (e.g., ad creatives, landing pages), we can combine disparate user information in a principled way to effectively and automatically target converters for new/existing campaigns. At the heart of our approach is a factor model that estimates the affinity of each user feature to a campaign using historical conversion data. In fact, our approach allows building a conversion model for a brand new campaign through campaign meta-data alone, and hence targets potential converters even before the campaign is run. Through extensive experiments, we show the superiority of our factor model approach relative to several other baselines. Moreover, we show that the performance of our approach at the beginning of a campaign's life is typically better than the other models even when they are trained using all conversion data after the campaign has completed. This clearly shows the importance and value of using historical campaign data in constructing an effective audience selection strategy for display advertising.
How effective is targeted advertising? BIBAFull-Text 111-120
  Ayman Farahat; Michael C. Bailey
Advertisers are demanding more accurate estimates of the impact of targeted advertisements, yet no study proposes an appropriate methodology to analyze the effectiveness of a targeted advertising campaign, and there is a dearth of empirical evidence on the effectiveness of targeted advertising as a whole. The targeted population is more likely to convert from advertising so the response lift between the targeted and untargeted group to the advertising is likely an overestimate of the impact of targeted advertising. We propose a difference-in-differences estimator to account for this selection bias by decomposing the impact of targeting into selection bias and treatment effects components. Using several large-scale online advertising campaigns, we test the effectiveness of targeted advertising on brand-related searches and clickthrough rates. We find that the treatment effect on the targeted group is about twice as large for brand-related searches, but naively estimating this effect without taking into account selection bias leads to an overestimation of the lift from targeting on brand-related searches by almost 1,000%.

Semantic web approaches in search

Compressed data structures for annotated web search BIBAFull-Text 121-130
  Soumen Chakrabarti; Sasidhar Kasturi; Bharath Balakrishnan; Ganesh Ramakrishnan; Rohit Saraf
Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 milliseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40x8 cores with 40x3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.
The SemSets model for ad-hoc semantic list search BIBAFull-Text 131-140
  Marek Ciglan; Kjetil Nørvåg; Ladislav Hluchý
The amount of semantic data on the web has been growing rapidly in recent years. One of the key challenges triggered by this growth is the ad-hoc querying, i.e., the ability to retrieve answers from semantic resources using natural language queries. This facilitates interaction with semantic resources for the users so they can benefit from the knowledge covered by semantic data without the complexities of semantic query languages. In this paper, we focus on semantic queries, where the aim is to retrieve objects belonging to a set of semantically related entities. An example of such an ad-hoc type query is "Apollo astronauts who walked on the Moon". In order to address the task, we propose the SemSets retrieval model that exploits and combines traditional document-based information retrieval, link structure of the semantic data and entity membership in semantic sets, in order to provide the answers. The novelty of the approach lies in the utilization of semantic sets, i.e., groups of semantically related entities. We propose two approaches to identify such semantic sets from the knowledge bases; the first one requires involvement of an expert user knowledgeable of the data set structure, the second one is fully automatic and provides results that are comparable with those delivered by the expert users. As demonstrated in the experimental evaluation, the proposed model has the state-of-the-art performance on the SemSearch2011 data set, which has been designed especially for the semantic list search evaluation.
Heterogeneous web data search using relevance-based on the fly data integration BIBAFull-Text 141-150
  Daniel M. Herzig; Thanh Tran
Searching over heterogeneous structured data on the Web is challenging due to vocabulary and structure mismatches among different data sources. In this paper, we study two existing strategies and present a new approach to integrate additional data sources into the search process. The first strategy relies on data integration to mediate mismatches through upfront computation of mappings, based on which queries are rewritten to fit individual sources. The other extreme is keyword search, which does not require any up-front investment, but ignores structure information. Building on these strategies, we present a hybrid approach, which combines the advantages of both. Our approach does not require any upfront data integration, but also leverages the fine grained structure of the underlying data. For a structured query adhering to the vocabulary of just one source, the so-called seed query, we construct an entity relevance model (ERM), which captures the content and the structure of the seed query results. This ERM is then aligned on the fly with keyword search results retrieved from other sources and also used to rank these results. The outcome of our experiments using large-scale real-world data sets suggests that data integration leads to higher search effectiveness compared to keyword search and that our new hybrid approach consistently exceeds both strategies.

Web performance

TailGate: handling long-tail content with a little help from friends BIBAFull-Text 151-160
  Stefano Traverso; Kévin Huguenin; Ionut Triestan; Vijay Erramilli; Nikolaos Laoutaris; Kostantina Papagiannaki
Distributing long-tail content is an inherently difficult task due to the low amortization of bandwidth transfer costs as such content has limited number of views. Two recent trends are making this problem harder. First, the increasing popularity of user-generated content (UGC) and online social networks (OSNs) create and reinforce such popularity distributions. Second, the recent trend of geo-replicating content across multiple PoPs spread around the world, done for improving quality of experience (QoE) for users and for redundancy reasons, can lead to unnecessary bandwidth costs. We build TailGate, a system that exploits social relationships, regularities in read access patterns, and time-zone differences to efficiently and selectively distribute long-tail content across PoPs. We evaluate TailGate using large traces from an OSN and show that it can decrease WAN bandwidth costs by as much as 80% as well as reduce latency, improving QoE. We deploy TailGate on PlanetLab and show that even in the case when imprecise social information is available, TailGate can still decrease the latency for accessing long-tail YouTube videos by a factor of 2.
DOHA: scalable real-time web applications through adaptive concurrent execution BIBAFull-Text 161-170
  Aiman Erbad; Norman C. Hutchinson; Charles Krasic
Browsers have become mature execution platforms enabling web applications to rival their desktop counterparts. An important class of such applications is interactive multimedia: games, animations, and interactive visualizations. Unlike many early web applications, these applications are latency sensitive and processing (CPU and graphics) intensive. When demands exceed available resources, application quality (e.g., frame rate) diminishes because it is hard to balance timeliness and utilization. The quality of ambitious web applications is also limited by single-threaded execution prevalent in the Web. Applications need to scale their quality, and thereby scale processing load, based on the resources that are available. We refer to this as scalable quality.
   DOHA is an execution layer written entirely in JavaScript to enable scalable quality in web applications. DOHA favors important computations with more influence over quality based on hints from application-specific adaptation policies. To utilize widely available multi-core resources, DOHA augments HTML5 web workers with mechanisms to facilitate state management and load-balancing. We evaluate DOHA with an award-winning web-based game. When resources are limited, the modified game has better timing and overall quality. More importantly, quality scales linearly with a small number of cores and the game is playable in challenging scenarios that are beyond the scope of the original game.
Surviving a search engine overload BIBAFull-Text 171-180
  Aaron Koehl; Haining Wang
Search engines are an essential component of the web, but their web crawling agents can impose a significant burden on heavily loaded web servers. Unfortunately, blocking or deferring web crawler requests is not a viable solution due to economic consequences. We conduct a quantitative measurement study on the impact and cost of web crawling agents, seeking optimization points for this class of request. Based on our measurements, we present a practical caching approach for mitigating search engine overload, and implement the two-level cache scheme on a very busy web server. Our experimental results show that the proposed caching framework can effectively reduce the impact of search engine overload on service quality.

Fraud and bias in user ratings

Semi-supervised correction of biased comment ratings BIBAFull-Text 181-190
  Abhinav Mishra; Rajeev Rastogi
In many instances, offensive comments on the internet attract a disproportionate number of positive ratings from highly biased users. This results in an undesirable scenario where these offensive comments are the top rated ones. In this paper, we develop semi-supervised learning techniques to correct the bias in user ratings of comments. Our scheme uses a small number of comment labels in conjunction with user rating information to iteratively compute user bias and unbiased ratings for unlabeled comments. We show that the running time of each iteration is linear in the number of ratings, and the system converges to a unique fixed point. To select the comments to label, we devise an active learning algorithm based on empirical risk minimization. Our active learning method incrementally updates the risk for neighboring comments each time a comment is labeled, and thus can easily scale to large comment datasets. On real-life comments from Yahoo! News, our semi-supervised and active learning algorithms achieve higher accuracy than simple baselines, with few labeled examples.
Spotting fake reviewer groups in consumer reviews BIBAFull-Text 191-200
  Arjun Mukherjee; Bing Liu; Natalie Glance
Opinionated social media such as product reviews are now widely used by individuals and organizations for their decision making. However, due to the reason of profit or fame, people try to game the system by opinion spamming (e.g., writing fake reviews) to promote or demote some target products. For reviews to reflect genuine user experiences and opinions, such spam reviews should be detected. Prior works on opinion spam focused on detecting fake reviews and individual fake reviewers. However, a fake reviewer group (a group of reviewers who work collaboratively to write fake reviews) is even more damaging as they can take total control of the sentiment on the target product due to its size. This paper studies spam detection in the collaborative setting, i.e., to discover fake reviewer groups. The proposed method first uses a frequent itemset mining method to find a set of candidate groups. It then uses several behavioral models derived from the collusion phenomenon among fake reviewers and relation models based on the relationships among groups, individual reviewers, and products they reviewed to detect fake reviewer groups. Additionally, we also built a labeled dataset of fake reviewer groups. Although labeling individual fake reviews and reviewers is very hard, to our surprise labeling fake reviewer groups is much easier. We also note that the proposed technique departs from the traditional supervised learning approach for spam detection because of the inherent nature of our problem which makes the classic supervised learning approach less effective. Experimental results show that the proposed method outperforms multiple strong baselines including the state-of-the-art supervised classification, regression, and learning to rank algorithms.
Estimating the prevalence of deception in online review communities BIBAFull-Text 201-210
  Myle Ott; Claire Cardie; Jeff Hancock
Consumers' purchase decisions are increasingly influenced by user-generated online reviews. Accordingly, there has been growing concern about the potential for posting deceptive opinion spam -- fictitious reviews that have been deliberately written to sound authentic, to deceive the reader. But while this practice has received considerable public attention and concern, relatively little is known about the actual prevalence, or rate, of deception in online review communities, and less still about the factors that influence it.
   We propose a generative model of deception which, in conjunction with a deception classifier, we use to explore the prevalence of deception in six popular online review communities: Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor, and Yelp. We additionally propose a theoretical model of online reviews based on economic signaling theory, in which consumer reviews diminish the inherent information asymmetry between consumers and producers, by acting as a signal to a product's true, unknown quality. We find that deceptive opinion spam is a growing problem overall, but with different growth rates across communities. These rates, we argue, are driven by the different signaling costs associated with deception for each review community, e.g., posting requirements. When measures are taken to increase signaling cost, e.g., filtering reviews written by first-time reviewers, deception prevalence is effectively reduced.

Mobile and file-sharing users

Musubi: disintermediated interactive social feeds for mobile devices BIBAFull-Text 211-220
  Ben Dodson; Ian Vo; T. J. Purtell; Aemon Cannon; Monica Lam
This paper presents Musubi, a mobile social application platform that enables users to share any data type in real-time feeds created by any application on the phone. Musubi is unique in providing a disintermediated service to end users; all communication is supported using public key encryption thus leaking no user information to a third party. Despite the heavy use of cryptography to provide user authentication and access control, users found Musubi simple to use. We embed key exchange within familiar friending actions, and allow users to interact with any friend in their address books without requiring them to join a common network a priori. Our feed abstraction allows users to easily exercise access control. All data reside on the phone, granting users the freedom to apply applications of their choice.
   In addition to disintermediating personal messaging, we have created an application platform to support multi-party software with the same respect for personal data. The SocialKit library we created on top of Musubi's trusted communication protocol facilitates the development of multi-party applications and integrates with Musubi to provide a compelling group application experience. SocialKit allows developers to make social, interactive, privacy-honoring applications without needing to host their own servers.
Economics of BitTorrent communities BIBAFull-Text 221-230
  Ian A. Kash; John K. Lai; Haoqi Zhang; Aviv Zohar
Over the years, private file-sharing communities built on the BitTorrent protocol have developed their own policies and mechanisms for motivating members to share content and contribute resources. By requiring members to maintain a minimum ratio between uploads and downloads, private communities effectively establish credit systems, and with them full-fledged economies. We report on a half-year-long measurement study of DIME -- a community for sharing live concert recordings -- that sheds light on the economic forces affecting users in such communities. A key observation is that while the download of files is priced only according to the size of the file, the rate of return for seeding new files is significantly greater than for seeding old files. We find via a natural experiment that users react to such differences in resale value by preferentially consuming older files during a 'free leech' period. We consider implications of these finding on a user's ability to earn credits and meet ratio enforcements, focusing in particular on the relationship between visitation frequency and wealth and on low bandwidth users. We then share details from an interview with DIME moderators, which highlights the goals of the community based on which we make suggestions for possible improvement.
A habit mining approach for discovering similar mobile users BIBAFull-Text 231-240
  Haiping Ma; Huanhuan Cao; Qiang Yang; Enhong Chen; Jilei Tian
Discovering similar users with respect to their habits plays an important role in a wide range of applications, such as collaborative filtering for recommendation, user segmentation for market analysis, etc. Recently, the progressing ability to sense user contexts of smart mobile devices makes it possible to discover mobile users with similar habits by mining their habits from their mobile devices. However, though some researchers have proposed effective methods for mining user habits such as behavior pattern mining, how to leverage the mined results for discovering similar users remains less explored. To this end, we propose a novel approach for conquering the sparseness of behavior pattern space and thus make it possible to discover similar mobile users with respect to their habits by leveraging behavior pattern mining. To be specific, first, we normalize the raw context log of each user by transforming the location-based context data and user interaction records to more general representations. Second, we take advantage of a constraint-based Bayesian Matrix Factorization model for extracting the latent common habits among behavior patterns and then transforming behavior pattern vectors to the vectors of mined common habits which are in a much more dense space. The experiments conducted on real data sets show that our approach outperforms three baselines in terms of the effectiveness of discovering similar mobile users with respect to their habits.

Behavioral analysis and content characterization in social media

YouTube around the world: geographic popularity of videos BIBAFull-Text 241-250
  Anders Brodersen; Salvatore Scellato; Mirjam Wattenhofer
One of the most popular user activities on the Web is watching videos. Services like YouTube, Vimeo, and Hulu host and stream millions of videos, providing content that is on par with TV. While some of this content is popular all over the globe, some videos might be only watched in a confined, local region.
   In this work we study the relationship between popularity and locality of online YouTube videos. We investigate whether YouTube videos exhibit geographic locality of interest, with views arising from a confined spatial area rather than from a global one. Our analysis is done on a corpus of more than 20 millions YouTube videos, uploaded over one year from different regions. We find that about 50% of the videos have more than 70% of their views in a single region. By relating locality to viralness we show that social sharing generally widens the geographic reach of a video. If, however, a video cannot carry its social impulse over to other means of discovery, it gets stuck in a more confined geographic region. Finally, we analyze how the geographic properties of a video's views evolve on a daily basis during its lifetime, providing new insights on how the geographic reach of a video changes as its popularity peaks and then fades away.
   Our results demonstrate how, despite the global nature of the Web, online video consumption appears constrained by geographic locality of interest: this has a potential impact on a wide range of systems and applications, spanning from delivery networks to recommendation and discovery engines, providing new directions for future research.
Dynamical classes of collective attention in twitter BIBAFull-Text 251-260
  Janette Lehmann; Bruno Gonçalves; José J. Ramasco; Ciro Cattuto
Micro-blogging systems such as Twitter expose digital traces of social discourse with an unprecedented degree of resolution of individual behaviors. They offer an opportunity to investigate how a large-scale social system responds to exogenous or endogenous stimuli, and to disentangle the temporal, spatial and topical aspects of users' activity. Here we focus on spikes of collective attention in Twitter, and specifically on peaks in the popularity of hashtags. Users employ hashtags as a form of social annotation, to define a shared context for a specific event, topic, or meme. We analyze a large-scale record of Twitter activity and find that the evolution of hashtag popularity over time defines discrete classes of hashtags. We link these dynamical classes to the events the hashtags represent and use text mining techniques to provide a semantic characterization of the hashtag classes. Moreover, we track the propagation of hashtags in the Twitter social network and find that epidemic spreading plays a minor role in hashtag popularity, which is mostly driven by exogenous factors.
We know what @you #tag: does the dual role affect hashtag adoption? BIBAFull-Text 261-270
  Lei Yang; Tao Sun; Ming Zhang; Qiaozhu Mei
Researchers and social observers have both believed that hashtags, as a new type of organizational objects of information, play a dual role in online microblogging communities (e.g., Twitter). On one hand, a hashtag serves as a bookmark of content, which links tweets with similar topics; on the other hand, a hashtag serves as the symbol of a community membership, which bridges a virtual community of users. Are the real users aware of this dual role of hashtags? Is the dual role affecting their behavior of adopting a hashtag? Is hashtag adoption predictable? We take the initiative to investigate and quantify the effects of the dual role on hashtag adoption. We propose comprehensive measures to quantify the major factors of how a user selects content tags as well as joins communities. Experiments using large scale Twitter datasets prove the effectiveness of the dual role, where both the content measures and the community measures significantly correlate to hashtag adoption on Twitter. With these measures as features, a machine learning model can effectively predict the future adoption of hashtags that a user has never used before.

Creating and using links between data objects

Factorizing YAGO: scalable machine learning for linked data BIBAFull-Text 271-280
  Maximilian Nickel; Volker Tresp; Hans-Peter Kriegel
Vast amounts of structured information have been published in the Semantic Web's Linked Open Data (LOD) cloud and their size is still growing rapidly. Yet, access to this information via reasoning and querying is sometimes difficult, due to LOD's size, partial data inconsistencies and inherent noisiness. Machine Learning offers an alternative approach to exploiting LOD's data with the advantages that Machine Learning algorithms are typically robust to both noise and data inconsistencies and are able to efficiently utilize non-deterministic dependencies in the data. From a Machine Learning point of view, LOD is challenging due to its relational nature and its scale. Here, we present an efficient approach to relational learning on LOD data, based on the factorization of a sparse tensor that scales to data consisting of millions of entities, hundreds of relations and billions of known facts. Furthermore, we show how ontological knowledge can be incorporated in the factorization to improve learning results and how computation can be distributed across multiple nodes. We demonstrate that our approach is able to factorize the YAGO 2 core ontology and globally predict statements for this large knowledge base using a single dual-core desktop computer. Furthermore, we show experimentally that our approach achieves good results in several relational learning tasks that are relevant to Linked Data. Once a factorization has been computed, our model is able to predict efficiently, and without any additional training, the likelihood of any of the 4.3 * 10^14 possible triples in the YAGO 2 core ontology.
Semantic navigation on the web of data: specification of routes, web fragments and actions BIBAFull-Text 281-290
  Valeria Fionda; Claudio Gutierrez; Giuseppe Pirró
The massive semantic data sources linked in the Web of Data give new meaning to old features like navigation; introduce new challenges like semantic specification of Web fragments; and make it possible to specify actions relying on semantic data. In this paper we introduce a declarative language to face these challenges. Based on navigational features, it is designed to specify fragments of the Web of Data and actions to be performed based on these data. We implement it in a centralized fashion, and show its power and performance. Finally, we explore the same ideas in a distributed setting, showing their feasibility, potentialities and challenges.
Understanding web images by object relation network BIBAFull-Text 291-300
  Na Chen; Qian-Yi Zhou; Viktor Prasanna
This paper presents an automatic method for understanding and interpreting the semantics of unannotated web images. We observe that the relations between objects in an image carry important semantics about the image. To capture and describe such semantics, we propose Object Relation Network (ORN), a graph model representing the most probable meaning of the objects and their relations in an image. Guided and constrained by an ontology, ORN transfers the rich semantics in the ontology to image objects and the relations between them, while maintaining semantic consistency (e.g., a soccer player can kick a soccer ball, but cannot ride it). We present an automatic system which takes a raw image as input and creates an ORN based on image visual appearance and the guide ontology. We demonstrate various useful web applications enabled by ORNs, such as automatic image tagging, automatic image description generation, and image search by image.

Security 1

Investigating the distribution of password choices BIBAFull-Text 301-310
  David Malone; Kevin Maher
The distribution of passwords chosen by users has implications for site security, password-handling algorithms and even how users are permitted to select passwords. Using password lists from four different web sites, we investigate if Zipf's law is a good description of the frequency with which passwords are chosen. We use a number of standard statistics, which measure the security of password distributions, to see if modelling the data using a simple distribution is effective. We then consider how much the password distributions from each site have in common, using password cracking as a metric. This shows that these distributions have enough high-frequency passwords in common to provide effective speed-ups for cracking passwords. Finally, as an alternative to a deterministic banned list, we will show how to stochastically shape the distribution of passwords, by occasionally asking users to choose a different password.
Is this app safe?: a large scale study on application permissions and risk signals BIBAFull-Text 311-320
  Pern Hui Chia; Yusuke Yamamoto; N. Asokan
Third-party applications (apps) drive the attractiveness of web and mobile application platforms. Many of these platforms adopt a decentralized control strategy, relying on explicit user consent for granting permissions that the apps request. Users have to rely primarily on community ratings as the signals to identify the potentially harmful and inappropriate apps even though community ratings typically reflect opinions about perceived functionality or performance rather than about risks. With the arrival of HTML5 web apps, such user-consent permission systems will become more widespread. We study the effectiveness of user-consent permission systems through a large scale data collection of Facebook apps, Chrome extensions and Android apps. Our analysis confirms that the current forms of community ratings used in app markets today are not reliable indicators of privacy risks of an app. We find some evidence indicating attempts to mislead or entice users into granting permissions: free applications and applications with mature content request more permissions than is typical; 'look-alike' applications which have names similar to popular applications also request more permissions than is typical. We also find that across all three platforms popular applications request more permissions than average.
SessionJuggler: secure web login from an untrusted terminal using session hijacking BIBAFull-Text 321-330
  Elie Bursztein; Chinmay Soman; Dan Boneh; John C. Mitchell
We use modern features of web browsers to develop a secure login system from an untrusted terminal. The system, called Session Juggler, requires no server-side changes and no special software on the terminal beyond a modern web browser. This important property makes adoption much easier than with previous proposals. With Session Juggler users never enter their long term credential on the untrusted terminal. Instead, users log in to a web site using a smartphone app and then transfer the entire session, including cookies and all other session state, to the untrusted terminal. We show that Session Juggler works on all the Alexa top 100 sites except eight. Of those eight, five failures were due to the site enforcing IP session binding. We also show that Session Juggler works flawlessly with Facebook connect. Beyond login, Session Juggler also provides a secure logout mechanism where the trusted phone is used to kill the session. To validate the session juggling concept we conducted a number of web site surveys that are of independent interest. First, we survey how web sites bind a session token to a specific device and show that most use fairly basic techniques that are easily defeated. Second, we survey how web sites handle logout and show that many popular sites surprisingly do not properly handle logout requests.

Community detection in social networks

Using content and interactions for discovering communities in social networks BIBAFull-Text 331-340
  Mrinmaya Sachan; Danish Contractor; Tanveer A. Faruquie; L. Venkata Subramaniam
In recent years, social networking sites have not only enabled people to connect with each other using social links but have also allowed them to share, communicate and interact over diverse geographical regions. Social network provide a rich source of heterogeneous data which can be exploited to discover previously unknown relationships and interests among groups of people. In this paper, we address the problem of discovering topically meaningful communities from a social network. We assume that a persons' membership in a community is conditioned on its social relationship, the type of interaction and the information communicated with other members of that community. We propose generative models that can discover communities based on the discussed topics, interaction types and the social connections among people. In our models a person can belong to multiple communities and a community can participate in multiple topics. This allows us to discover both community interests and user interests based on the information and linked associations. We demonstrate the effectiveness of our model on two real word data sets and show that it performs better than existing community discovery models.
Community detection in incomplete information networks BIBAFull-Text 341-350
  Wangqun Lin; Xiangnan Kong; Philip S. Yu; Quanyuan Wu; Yan Jia; Chuan Li
With the recent advances in information networks, the problem of community detection has attracted much attention in the last decade. While network community detection has been ubiquitous, the task of collecting complete network data remains challenging in many real-world applications. Usually the collected network is incomplete with most of the edges missing. Commonly, in such networks, all nodes with attributes are available while only the edges within a few local regions of the network can be observed. In this paper, we study the problem of detecting communities in incomplete information networks with missing edges. We first learn a distance metric to reproduce the link-based distance between nodes from the observed edges in the local information regions. We then use the learned distance metric to estimate the distance between any pair of nodes in the network. A hierarchical clustering approach is proposed to detect communities within the incomplete information networks. Empirical studies on real-world information networks demonstrate that our proposed method can effectively detect community structures within incomplete information networks.
QUBE: a quick algorithm for updating betweenness centrality BIBAFull-Text 351-360
  Min-Joong Lee; Jungmin Lee; Jaimie Yejean Park; Ryan Hyun Choi; Chin-Wan Chung
The betweenness centrality of a vertex in a graph is a measure for the participation of the vertex in the shortest paths in the graph. The Betweenness centrality is widely used in network analyses. Especially in a social network, the recursive computation of the betweenness centralities of vertices is performed for the community detection and finding the influential user in the network. Since a social network graph is frequently updated, it is necessary to update the betweenness centrality efficiently. When a graph is changed, the betweenness centralities of all the vertices should be recomputed from scratch using all the vertices in the graph. To the best of our knowledge, this is the first work that proposes an efficient algorithm which handles the update of the betweenness centralities of vertices in a graph. In this paper, we propose a method that efficiently reduces the search space by finding a candidate set of vertices whose betweenness centralities can be updated and computes their betweenness centeralities using candidate vertices only. As the cost of calculating the betweenness centrality mainly depends on the number of vertices to be considered, the proposed algorithm significantly reduces the cost of calculation. The proposed algorithm allows the transformation of an existing algorithm which does not consider the graph update. Experimental results on large real datasets show that the proposed algorithm speeds up the existing algorithm 2 to 2418 times depending on the dataset.

Advertising on the web 2

On revenue in the generalized second price auction BIBAFull-Text 361-370
  Brendan Lucier; Renato Paes Leme; Eva Tardos
The Generalized Second Price (GSP) auction is the primary auction used for selling sponsored search advertisements. In this paper we consider the revenue of this auction at equilibrium. We prove that if agent values are drawn from identical regular distributions, then the GSP auction paired with an appropriate reserve price generates a constant fraction (1/6th) of the optimal revenue. In the full-information game, we show that at any Nash equilibrium of the GSP auction obtains at least half of the revenue of the VCG mechanism excluding the payment of a single participant. This bound holds also with any reserve price, and is tight.
   Finally, we consider the tradeoff between maximizing revenue and social welfare. We introduce a natural convexity assumption on the click-through rates and show that it implies that the revenue-maximizing equilibrium of GSP in the full information model will necessarily be envy-free. In particular, it is always possible to maximize revenue and social welfare simultaneously when click-through rates are convex. Without this convexity assumption, however, we demonstrate that revenue may be maximized at a non-envy-free equilibrium that generates a socially inefficient allocation.
Handling forecast errors while bidding for display advertising BIBAFull-Text 371-380
  Kevin J. Lang; Benjamin Moseley; Sergei Vassilvitskii
Most of the online advertising today is sold via an auction, which requires the advertiser to respond with a valid bid within a fraction of a second. As such, most advertisers employ bidding agents to submit bids on their behalf. The architecture of such agents typically has (1) an offline optimization phase which incorporates the bidder's knowledge about the market and (2) an online bidding strategy which simply executes the offline strategy. The online strategy is typically highly dependent on both supply and expected price distributions, both of which are forecast using traditional machine learning methods. In this work we investigate the optimum strategy of the bidding agent when faced with incorrect forecasts. At a high level, the agent can invest resources in improving the forecasts, or can tighten the loop between successive offline optimization cycles in order to detect errors more quickly. We show analytically that the latter strategy, while simple, is extremely effective in dealing with forecast errors, and confirm this finding with experimental evaluations.
Optimizing budget allocation among channels and influencers BIBAFull-Text 381-388
  Noga Alon; Iftah Gamzu; Moshe Tennenholtz
Brands and agencies use marketing as a tool to influence customers. One of the major decisions in a marketing plan deals with the allocation of a given budget among media channels in order to maximize the impact on a set of potential customers. A similar situation occurs in a social network, where a marketing budget needs to be distributed among a set of potential influencers in a way that provides high-impact.
   We introduce several probabilistic models to capture the above scenarios. The common setting of these models consists of a bipartite graph of source and target nodes. The objective is to allocate a fixed budget among the source nodes to maximize the expected number of influenced target nodes. The concrete way in which source nodes influence target nodes depends on the underlying model. We primarily consider two models: a source-side influence model, in which a source node that is allocated a budget of k makes k independent trials to influence each of its neighboring target nodes, and a target-side influence model, in which a target node becomes influenced according to a specified rule that depends on the overall budget allocated to its neighbors. Our main results are an optimal (1-1/e)-approximation algorithm for the source-side model, and several inapproximability results for the target-side model, establishing that influence maximization in the latter model is provably harder.


Structured query suggestion for specialization and parallel movement: effect on search behaviors BIBAFull-Text 389-398
  Makoto P. Kato; Tetsuya Sakai; Katsumi Tanaka
Query suggestion, which enables the user to revise a query with a single click, has become one of the most fundamental features of Web search engines. However, it is often difficult for the user to choose from a list of query suggestions, and to understand the relation between an input query and suggested ones. In this paper, we propose a new method to present query suggestions to the user, which has been designed to help two popular query reformulation actions, namely, specialization (e.g. from "nikon" to "nikon camera") and parallel movement (e.g. from "nikon camera" to "canon camera"). Using a query log collected from a popular commercial Web search engine, our prototype called SParQS classifies query suggestions into automatically generated categories and generates a label for each category. Moreover, SParQS presents some new entities as alternatives to the original query (e.g. "canon" in response to the query "nikon"), together with their query suggestions classified in the same way as the original query's suggestions. We conducted a task-based user study to compare SParQS with a traditional "flat list" query suggestion interface. Our results show that the SParQS interface enables subjects to search more successfully than the flat list case, even though query suggestions presented were exactly the same in the two interfaces. In addition, the subjects found the query suggestions more helpful when they were presented in the SParQS interface rather than in a flat list.
Partitioned multi-indexing: bringing order to social search BIBAFull-Text 399-408
  Bahman Bahmani; Ashish Goel
To answer search queries on a social network rich with user-generated content, it is desirable to give a higher ranking to content that is closer to the individual issuing the query. Queries occur at nodes in the network, documents are also created by nodes in the same network, and the goal is to find the document that matches the query and is closest in network distance to the node issuing the query. In this paper, we present the "Partitioned Multi-Indexing" scheme, which provides an approximate solution to this problem. With m links in the network, after an offline O(m) pre-processing time, our scheme allows for social index operations (i.e., social search queries, as well as insertion and deletion of words into and from a document at any node), all in time O(1). Further, our scheme can be implemented on open source distributed streaming systems such as Yahoo! S4 or Twitter's Storm so that every social index operation takes O(1) processing time and network queries in the worst case, and just two network queries in the common case where the reverse index corresponding to the query keyword is much smaller than the memory available at any distributed compute node. Building on Das Sarma et al.'s approximate distance oracle, the worst-case approximation ratio of our scheme is O(1) for undirected networks. Our simulations on the social network Twitter as well as synthetic networks show that in practice, the approximation ratio is actually close to 1 for both directed and undirected networks. We believe that this work is the first demonstration of the feasibility of social search with real-time text updates at large scales.
Unsupervised extraction of template structure in web search queries BIBAFull-Text 409-418
  Sandeep Pandey; Kunal Punera
Web search queries are an encoding of the user's search intent and extracting structured information from them can facilitate central search engine operations like improving the ranking of search results and advertisements. Not surprisingly, this area has attracted a lot of attention in the research community in the last few years. The problem is, however, made challenging by the fact that search queries tend to be extremely succinct; a condensation of user search needs to the bare-minimum set of keywords. In this paper we consider the problem of extracting, with no manual intervention, the hidden structure behind the observed search queries in a domain: the origins of the constituent keywords as well as the manner the individual keywords are assembled together. We formalize important properties of the problem and then give a principled solution based on generative models that satisfies these properties. Using manually labeled data we show that the query templates extracted by our solution are superior to those discovered by strong baseline methods.
   The query templates extracted by our approach have potential uses in many search engine tasks; query answering, advertisement matching and targeting, to name a few. In this paper we study one such task, estimating Query-Advertisability, and empirically demonstrate that using extracted template information can improve performance over and above the current state-of-the-art.

Obtaining and leveraging user comments

Multi-objective ranking of comments on web BIBAFull-Text 419-428
  Onkar Dalal; Srinivasan H. Sengemedu; Subhajit Sanyal
With the explosion of information on any topic, the need for ranking is becoming very critical. Ranking typically depends on several aspects. Products, for example, have several aspects like price, recency, rating, etc. Product ranking has to bring the "best" product which is recent and highly rated. Hence ranking has to satisfy multiple objectives. In this paper, we explore multi-objective ranking of comments using Hodge decomposition. While Hodge decomposition produces a globally consistent ranking, a globally inconsistent component is also present. We propose an active learning strategy for the reduction of this component. Finally, we develop techniques for online Hodge decomposition. We experimentally validate the ideas presented in this paper.
Care to comment?: recommendations for commenting on news stories BIBAFull-Text 429-438
  Erez Shmueli; Amit Kagian; Yehuda Koren; Ronny Lempel
Many websites provide commenting facilities for users to express their opinions or sentiments with regards to content items, such as, videos, news stories, blog posts, etc. Previous studies have shown that user comments contain valuable information that can provide insight on Web documents and may be utilized for various tasks. This work presents a model that predicts, for a given user, suitable news stories for commenting. The model achieves encouraging results regarding the ability to connect users with stories they are likely to comment on. This provides grounds for personalized recommendations of stories to users who may want to take part in their discussion. We combine a content-based approach with a collaborative-filtering approach (utilizing users' co-commenting patterns) in a latent factor modeling framework. We experiment with several variations of the model's loss function in order to adjust it to the problem domain. We evaluate the results on two datasets and show that employing co-commenting patterns improves upon using content features alone, even with as few as two available comments per story. Finally, we try to incorporate available social network data into the model. Interestingly, the social data does not lead to substantial performance gains, suggesting that the value of social data for this task is quite negligible.
Leveraging user comments for aesthetic aware image search reranking BIBAFull-Text 439-448
  Jose San Pedro; Tom Yeh; Nuria Oliver
The increasing number of images available online has created a growing need for efficient ways to search for relevant content. Text-based query search is the most common approach to retrieve images from the Web. In this approach, the similarity between the input query and the metadata of images is used to find relevant information. However, as the amount of available images grows, the number of relevant images also increases, all of them sharing very similar metadata but differing in other visual characteristics. This paper studies the influence of visual aesthetic quality in search results as a complementary attribute to relevance. By considering aesthetics, a new ranking parameter is introduced aimed at improving the quality at the top ranks when large amounts of relevant results exist. Two strategies for aesthetic rating inference are proposed: one based on visual content, another based on the analysis of user comments to detect opinions about the quality of images. The results of a user study with $58$ participants show that the comment-based aesthetic predictor outperforms the visual content-based strategy, and reveals that aesthetic-aware rankings are preferred by users searching for photographs on the Web.

Entity linking

LINDEN: linking named entities with knowledge base via semantic knowledge BIBAFull-Text 449-458
  Wei Shen; Jianyong Wang; Ping Luo; Min Wang
Integrating the extracted facts with an existing knowledge base has raised an urgent need to address the problem of entity linking. Specifically, entity linking is the task to link the entity mention in text with the corresponding real world entity in the existing knowledge base. However, this task is challenging due to name ambiguity, textual inconsistency, and lack of world knowledge in the knowledge base. Several methods have been proposed to tackle this problem, but they are largely based on the co-occurrence statistics of terms between the text around the entity mention and the document associated with the entity. In this paper, we propose LINDEN, a novel framework to link named entities in text with a knowledge base unifying Wikipedia and WordNet, by leveraging the rich semantic knowledge embedded in the Wikipedia and the taxonomy of the knowledge base. We extensively evaluate the performance of our proposed LINDEN over two public data sets and empirical results show that LINDEN significantly outperforms the state-of-the-art methods in terms of accuracy.
Cross-lingual knowledge linking across wiki knowledge bases BIBAFull-Text 459-468
  Zhichun Wang; Juanzi Li; Zhigang Wang; Jie Tang
Wikipedia becomes one of the largest knowledge bases on the Web. It has attracted 513 million page views per day in January 2012. However, one critical issue for Wikipedia is that articles in different language are very unbalanced. For example, the number of articles on Wikipedia in English has reached 3.8 million, while the number of Chinese articles is still less than half million and there are only 217 thousand cross-lingual links between articles of the two languages. On the other hand, there are more than 3.9 million Chinese Wiki articles on Baidu Baike and Hudong.com, two popular encyclopedias in Chinese. One important question is how to link the knowledge entries distributed in different knowledge bases. This will immensely enrich the information in the online knowledge bases and benefit many applications. In this paper, we study the problem of cross-lingual knowledge linking and present a linkage factor graph model. Features are defined according to some interesting observations. Experiments on the Wikipedia data set show that our approach can achieve a high precision of 85.8% with a recall of 88.1%. The approach found 202,141 new cross-lingual links between English Wikipedia and Baidu Baike.
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking BIBAFull-Text 469-478
  Gianluca Demartini; Djellel Eddine Difallah; Philippe Cudré-Mauroux
We tackle the problem of entity linking for large collections of online pages; Our system, ZenCrowd, identifies entities from natural language text using state of the art techniques and automatically connects them to the Linked Open Data cloud. We show how one can take advantage of human intelligence to improve the quality of the links by dynamically generating micro-tasks on an online crowdsourcing platform. We develop a probabilistic framework to make sensible decisions about candidate links and to identify unreliable human workers. We evaluate ZenCrowd in a real deployment and show how a combination of both probabilistic reasoning and crowdsourcing techniques can significantly improve the quality of the links, while limiting the amount of work performed by the crowd.

Search: evaluation and ranking

A flexible generative model for preference aggregation BIBAFull-Text 479-488
  Maksims N. Volkovs; Richard S. Zemel
Many areas of study, such as information retrieval, collaborative filtering, and social choice face the preference aggregation problem, in which multiple preferences over objects must be combined into a consensus ranking. Preferences over items can be expressed in a variety of forms, which makes the aggregation problem difficult. In this work we formulate a flexible probabilistic model over pairwise comparisons that can accommodate all these forms. Inference in the model is very fast, making it applicable to problems with hundreds of thousands of preferences. Experiments on benchmark datasets demonstrate superior performance to existing methods.
Evaluating the effectiveness of search task trails BIBAFull-Text 489-498
  Zhen Liao; Yang Song; Li-wei He; Yalou Huang
In this paper, we introduce "task trail" as a new concept to understand user search behaviors. We define task to be an atomic user information need. Web search logs have been studied mainly at session or query level where users may submit several queries within one task and handle several tasks within one session. Although previous studies have addressed the problem of task identification, little is known about the advantage of using task over session and query for search applications. In this paper, we conduct extensive analyses and comparisons to evaluate the effectiveness of task trails in three search applications: determining user satisfaction, predicting user search interests, and query suggestion. Experiments are conducted on large scale datasets from a commercial search engine. Experimental results show that: (1) Sessions and queries are not as precise as tasks in determining user satisfaction. (2) Task trails provide higher web page utilities to users than other sources. (3) Tasks represent atomic user information needs, and therefore can preserve topic similarity between query pairs. (4) Task-based query suggestion can provide complementary results to other models. The findings in this paper verify the need to extract task trails from web search logs and suggest potential applications in search and recommendation systems.
Evaluation with informational and navigational intents BIBAFull-Text 499-508
  Tetsuya Sakai
Given an ambiguous or underspecified query, search result diversification aims at accommodating different user intents within a single "entry-point" result page. However, some intents are informational, for which many relevant pages may help, while others are navigational, for which only one web page is required. We propose new evaluation metrics for search result diversification that considers this distinction, as well as a simple method for comparing the intuitiveness of a given pair of metrics quantitatively. Our main experimental findings are: (a) In terms of discriminative power which reflects statistical reliability, the proposed metrics, DIN#-nDCG and P+Q#, are comparable to intent recall and D#-nDCG, and possibly superior to α-nDCG; (b) In terms of preference agreement with intent recall, P+Q# is superior to other diversity metrics and therefore may be the most intuitive as a metric that emphasises diversity; and (c) In terms of preference agreement with effective precision, DIN#-nDCG is superior to other diversity metrics and therefore may be the most intuitive as a metric that emphasises relevance. Moreover, DIN#-nDCG may be the most intuitive as a metric that considers both diversity and relevance. In addition, we demonstrate that the randomised Tukey's Honestly Significant Differences test that takes the entire set of available runs into account is substantially more conservative than the paired bootstrap test that only considers one run pair at a time, and therefore recommend the former approach for significance testing when a set of runs is available for evaluation.

Information diffusion in social networks

Information transfer in social media BIBAFull-Text 509-518
  Greg Ver Steeg; Aram Galstyan
Recent research has explored the increasingly important role of social media by examining the dynamics of individual and group behavior, characterizing patterns of information diffusion, and identifying influential individuals. In this paper we suggest a measure of causal relationships between nodes based on the information-theoretic notion of transfer entropy, or information transfer. This theoretically grounded measure is based on dynamic information, captures fine-grain notions of influence, and admits a natural, predictive interpretation. Networks inferred by transfer entropy can differ significantly from static friendship networks because most friendship links are not useful for predicting future dynamics. We demonstrate through analysis of synthetic and real-world data that transfer entropy reveals meaningful hidden network structures. In addition to altering our notion of who is influential, transfer entropy allows us to differentiate between weak influence over large groups and strong influence over small groups.
The role of social networks in information diffusion BIBAFull-Text 519-528
  Eytan Bakshy; Itamar Rosenn; Cameron Marlow; Lada Adamic
Online social networking technologies enable individuals to simultaneously share information with any number of peers. Quantifying the causal effect of these mediums on the dissemination of information requires not only identification of who influences whom, but also of whether individuals would still propagate information in the absence of social signals about that information. We examine the role of social networks in online information diffusion with a large-scale field experiment that randomizes exposure to signals about friends' information sharing among 253 million subjects in situ. Those who are exposed are significantly more likely to spread information, and do so sooner than those who are not exposed. We further examine the relative role of strong and weak ties in information propagation. We show that, although stronger ties are individually more influential, it is the more abundant weak ties who are responsible for the propagation of novel information. This suggests that weak ties may play a more dominant role in the dissemination of information online than currently believed.
Recommendations to boost content spread in social networks BIBAFull-Text 529-538
  Vineet Chaoji; Sayan Ranu; Rajeev Rastogi; Rushi Bhatt
Content sharing in social networks is a powerful mechanism for discovering content on the Internet. The degree to which content is disseminated within the network depends on the connectivity relationships among network nodes. Existing schemes for recommending connections in social networks are based on the number of common neighbors, similarity of user profiles, etc. However, such similarity-based connections do not consider the amount of content discovered.
   In this paper, we propose novel algorithms for recommending connections that boost content propagation in a social network without compromising on the relevance of the recommendations. Unlike existing work on influence propagation, in our environment, we are looking for edges instead of nodes, with a bound on the number of incident edges per node. We show that the content spread function is not submodular, and develop approximation algorithms for computing a near-optimal set of edges. Through experiments on real-world social graphs such as Flickr and Twitter, we show that our approximation algorithms achieve content spreads that are as much as 90 times higher compared to existing heuristics for recommending connections.

Game theory and the web

Implementing optimal outcomes in social computing: a game-theoretic approach BIBAFull-Text 539-548
  Arpita Ghosh; Patrick Hummel
In many social computing applications such as online Q&A forums, the best contribution for each task receives some high reward, while all remaining contributions receive an identical, lower reward irrespective of their actual qualities. Suppose a mechanism designer (site owner) wishes to optimize an objective that is some function of the number and qualities of received contributions. When potential contributors are strategic agents, who decide whether to contribute or not to selfishly maximize their own utilities, is such a "best contribution" mechanism, Mb, adequate to implement an outcome that is optimal for the mechanism designer? We first show that in settings where a contribution's value is determined primarily by an agent's expertise, and agents only strategically choose whether to contribute or not, contests can implement optimal outcomes: for any reasonable objective, the rewards for the best and remaining contributions in Mb can always be chosen so that the outcome in the unique symmetric equilibrium of Mb maximizes the mechanism designer's utility. We also show how the mechanism designer can learn these optimal rewards when she does not know the parameters of the agents' utilities, as might be the case in practice. We next consider settings where a contribution's value depends on both the contributor's expertise as well as her effort, and agents endogenously choose how much effort to exert in addition to deciding whether to contribute. Here, we show that optimal outcomes can never be implemented by contests if the system can rank the qualities of contributions perfectly. However, if there is noise in the contributions' rankings, then the mechanism designer can again induce agents to follow strategies that maximize his utility. Thus imperfect rankings can actually help achieve implementability of optimal outcomes when effort is endogenous and influences quality.
Lattice games and the economics of aggregators BIBAFull-Text 549-558
  Patrick R. Jordan; Uri Nadav; Kunal Punera; Andrzej Skrzypacz; George Varghese
We model the strategic decisions of web sites in content markets, where sites may reduce user search cost by aggregating content. Example aggregations include political news, technology, and other niche-topic websites. We model this market scenario as an extensive form game of complete information, where sites choose a set of content to aggregate and users associate with sites that are nearest to their interests. Thus, our scenario is a location game in which sites choose to aggregate content at a certain point in user-preference space, and our choice of distance metric, Jacquard distance, induces a lattice structure on the game. We provide two variants of this scenario: one where users associate with the first site to enter amongst sites of equal distances, and a second where users choose uniformly between sites at equal distances. We show that subgame perfect Nash equilibria exist for both games. While it appears to be computationally hard to compute equilibria in both games, we show a polynomial-time satisficing strategy called Frontier Descent for the first game. A satisficing strategy is not a best response, but ensures that earlier sites will have positive profits, assuming all subsequent sites also have positive profits. By contrast, we show that the second game has no satisficing solution.
Strategic formation of credit networks BIBAFull-Text 559-568
  Pranav Dandekar; Ashish Goel; Michael P. Wellman; Bryce Wiedenbeck
Credit networks are an abstraction for modeling trust between agents in a network. Agents who do not directly trust each other can transact through exchange of IOUs (obligations) along a chain of trust in the network. Credit networks are robust to intrusion, can enable transactions between strangers in exchange economies, and have the liquidity to support a high rate of transactions. We study the formation of such networks when agents strategically decide how much credit to extend each other. When each agent trusts a fixed set of other agents, and transacts directly only with those it trusts, the formation game is a potential game and all Nash equilibria are social optima. Moreover, the Nash equilibria of this game are equivalent in a very strong sense: the sequences of transactions that can be supported from each equilibrium credit network are identical. When we allow transactions over longer paths, the game may not admit a Nash equilibrium, and even when it does, the price of anarchy may be unbounded. Hence, we study two special cases. First, when agents have a shared belief about the trustworthiness of each agent, the networks formed in equilibrium have a star-like structure. Though the price of anarchy is unbounded, myopic best response quickly converges to a social optimum. Similar star-like structures are found in equilibria of heuristic strategies found via simulation. In addition, we simulate a second case where agents may have varying information about each others' trustworthiness based on their distance in a social network. Empirical game analysis of these scenarios suggests that star structures arise only when defaults are relatively rare, and otherwise, credit tends to be issued over short social distances conforming to the locality of information.

Leveraging user actions in search

Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior BIBAFull-Text 569-578
  Qi Guo; Eugene Agichtein
Result clickthrough statistics and dwell time on clicked results have been shown valuable for inferring search result relevance, but the interpretation of these signals can vary substantially for different tasks and users. This paper shows that that post-click searcher behavior, such as cursor movement and scrolling, provides additional clues for better estimating document relevance. To this end, we identify patterns of examination and interaction behavior that correspond to viewing a relevant or non-relevant document, and design a new Post-Click Behavior (PCB) model to capture these patterns. To our knowledge, PCB is the first to successfully incorporate post-click searcher interactions such as cursor movements and scrolling on a landing page for estimating document relevance. We evaluate PCB on a dataset collected from a controlled user study that contains interactions gathered from hundreds of unique queries, result clicks, and page examinations. The experimental results show that PCB is significantly more effective than using page dwell time information alone, both for estimating the explicit judgments of each user, and for re-ranking the results using the estimated relevance.
Joint relevance and freshness learning from clickthroughs for news search BIBAFull-Text 579-588
  Hongning Wang; Anlei Dong; Lihong Li; Yi Chang; Evgeniy Gabrilovich
In contrast to traditional Web search, where topical relevance is often the main selection criterion, news search is characterized by the increased importance of freshness. However, the estimation of relevance and freshness, and especially the relative importance of these two aspects, are highly specific to the query and the time when the query was issued. In this work, we propose a unified framework for modeling the topical relevance and freshness, as well as their relative importance, based on click logs. We use click statistics and content analysis techniques to define a set of temporal features, which predict the right mix of freshness and relevance for a given query. Experimental results on both historical click data and editorial judgments demonstrate the effectiveness of the proposed approach.
Active objects: actions for entity-centric search BIBAFull-Text 589-598
  Thomas Lin; Patrick Pantel; Michael Gamon; Anitha Kannan; Ariel Fuxman
We introduce an entity-centric search experience, called Active Objects, in which entity-bearing queries are paired with actions that can be performed on the entities. For example, given a query for a specific flashlight, we aim to present actions such as reading reviews, watching demo videos, and finding the best price online. In an annotation study conducted over a random sample of user query sessions, we found that a large proportion of queries in query logs involve actions on entities, calling for an automatic approach to identifying relevant actions for entity-bearing queries. In this paper, we pose the problem of finding actions that can be performed on entities as the problem of probabilistic inference in a graphical model that captures how an entity bearing query is generated. We design models of increasing complexity that capture latent factors such as entity type and intended actions that determine how a user writes a query in a search box, and the URL that they click on. Given a large collection of real-world queries and clicks from a commercial search engine, the models are learned efficiently through maximum likelihood estimation using an EM algorithm. Given a new query, probabilistic inference enables recommendation of a set of pertinent actions and hosts. We propose an evaluation methodology for measuring the relevance of our recommended actions, and show empirical evidence of the quality and the diversity of the discovered actions.

Web user behavioral analysis and modeling

Modeling and predicting behavioral dynamics on the web BIBAFull-Text 599-608
  Kira Radinsky; Krysta Svore; Susan Dumais; Jaime Teevan; Alex Bocharov; Eric Horvitz
User behavior on the Web changes over time. For example, the queries that people issue to search engines, and the underlying informational goals behind the queries vary over time. In this paper, we examine how to model and predict this temporal user behavior. We develop a temporal modeling framework adapted from physics and signal processing that can be used to predict time-varying user behavior using smoothing and trends. We also explore other dynamics of Web behaviors, such as the detection of periodicities and surprises. We develop a learning procedure that can be used to construct models of users' activities based on features of current and historical behaviors. The results of experiments indicate that by using our framework to predict user behavior, we can achieve significant improvements in prediction compared to baseline models that weight historical evidence the same for all queries. We also develop a novel learning algorithm that explicitly learns when to apply a given prediction model among a set of such models. Our improved temporal modeling of user behavior can be used to enhance query suggestions, crawling policies, and result ranking.
Are web users really Markovian? BIBAFull-Text 609-618
  Flavio Chierichetti; Ravi Kumar; Prabhakar Raghavan; Tamas Sarlos
User modeling on the Web has rested on the fundamental assumption of Markovian behavior -- a user's next action depends only on her current state, and not the history leading up to the current state. This forms the underpinning of PageRank web ranking, as well as a number of techniques for targeting advertising to users. In this work we examine the validity of this assumption, using data from a number of Web settings. Our main result invokes statistical order estimation tests for Markov chains to establish that Web users are not, in fact, Markovian. We study the extent to which the Markovian assumption is invalid, and derive a number of avenues for further research.
Human wayfinding in information networks BIBAFull-Text 619-628
  Robert West; Jure Leskovec
Navigating information spaces is an essential part of our everyday lives, and in order to design efficient and user-friendly information systems, it is important to understand how humans navigate and find the information they are looking for. We perform a large-scale study of human wayfinding, in which, given a network of links between the concepts of Wikipedia, people play a game of finding a short path from a given start to a given target concept by following hyperlinks. What distinguishes our setup from other studies of human Web-browsing behavior is that in our case people navigate a graph of connections between concepts, and that the exact goal of the navigation is known ahead of time. We study more than 30,000 goal-directed human search paths and identify strategies people use when navigating information spaces. We find that human wayfinding, while mostly very efficient, differs from shortest paths in characteristic ways. Most subjects navigate through high-degree hubs in the early phase, while their search is guided by content features thereafter. We also observe a trade-off between simplicity and efficiency: conceptually simple solutions are more common but tend to be less efficient than more complex ones. Finally, we consider the task of predicting the target a user is trying to reach. We design a model and an efficient learning algorithm. Such predictive models of human wayfinding can be applied in intelligent browsing interfaces.

Ontology representation and querying: RDF and SPARQL

Counting beyond a Yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard BIBAFull-Text 629-638
  Marcelo Arenas; Sebastián Conca; Jorge Pérez
SPARQL -- the standard query language for querying RDF -- provides only limited navigational functionalities, although these features are of fundamental importance for graph data formats such as RDF. This has led the W3C to include the property path feature in the upcoming version of the standard, SPARQL 1.1.We tested several implementations of SPARQL 1.1 handling property path queries, and we observed that their evaluation methods for this class of queries have a poor performance even in some very simple scenarios. To formally explain this fact, we conduct a theoretical study of the computational complexity of property paths evaluation. Our results imply that the poor performance of the tested implementations is not a problem of these particular systems, but of the specification itself. In fact, we show that any implementation that adheres to the SPARQL 1.1 specification (as of November 2011) is doomed to show the same behavior, the key issue being the need for counting solutions imposed by the current specification. We provide several intractability results, that together with our empirical results, provide strong evidence against the current semantics of SPARQL 1.1 property paths. Finally, we put our results in perspective, and propose a natural alternative semantics with tractable evaluation, that we think may lead to a wide adoption of the language by practitioners, developers and theoreticians.
Template-based question answering over RDF data BIBAFull-Text 639-648
  Christina Unger; Lorenz Bühmann; Jens Lehmann; Axel-Cyrille Ngonga Ngomo; Daniel Gerber; Philipp Cimiano
As an increasing amount of RDF data is published as Linked Data, intuitive ways of accessing this data become more and more important. Question answering approaches have been proposed as a good compromise between intuitiveness and expressivity. Most question answering systems translate questions into triples which are matched against the RDF data to retrieve an answer, typically relying on some similarity metric. However, in many cases, triples do not represent a faithful representation of the semantic structure of the natural language question, with the result that more expressive queries can not be answered. To circumvent this problem, we present a novel approach that relies on a parse of the question to produce a SPARQL template that directly mirrors the internal structure of the question. This template is then instantiated using statistical entity identification and predicate detection. We show that this approach is competitive and discuss cases of questions that can be answered with our approach but not with competing approaches.
On directly mapping relational databases to RDF and OWL BIBAFull-Text 649-658
  Juan F. Sequeda; Marcelo Arenas; Daniel P. Miranker
Mapping relational databases to RDF is a fundamental problem for the development of the Semantic Web. We present a solution, inspired by draft methods defined by the W3C where relational databases are directly mapped to RDF and OWL. Given a relational database schema and its integrity constraints, this direct mapping produces an OWL ontology, which, provides the basis for generating RDF instances. The semantics of this mapping is defined using Datalog. Two fundamental properties are information preservation and query preservation. We prove that our mapping satisfies both conditions, even for relational databases that contain null values. We also consider two desirable properties: monotonicity and semantics preservation. We prove that our mapping is monotone and also prove that no monotone mapping, including ours, is semantic preserving. We realize that monotonicity is an obstacle for semantic preservation and thus present a non-monotone direct mapping that is semantics preserving.

Security 2

Practical end-to-end web content integrity BIBAFull-Text 659-668
  Kapil Singh; Helen J. Wang; Alexander Moshchuk; Collin Jackson; Wenke Lee
Widespread growth of open wireless hotspots has made it easy to carry out man-in-the-middle attacks and impersonate web sites. Although HTTPS can be used to prevent such attacks, its universal adoption is hindered by its performance cost and its inability to leverage caching at intermediate servers (such as CDN servers and caching proxies) while maintaining end-to-end security. To complement HTTPS, we revive an old idea from SHTTP, a protocol that offers end-to-end web integrity without confidentiality. We name the protocol HTTPi and give it an efficient design that is easy to deploy for today's web. In particular, we tackle several previously-unidentified challenges, such as supporting progressive page loading on the client's browser, handling mixed content, and defining access control policies among HTTP, HTTPi, and HTTPS content from the same domain. Our prototyping and evaluation experience show that HTTPi incurs negligible performance overhead over HTTP, can leverage existing web infrastructure such as CDNs or caching proxies without any modifications to them, and can make many of the mixed-content problems in existing HTTPS web sites easily go away. Based on this experience, we advocate browser and web server vendors to adopt HTTPi.
Online modeling of proactive moderation system for auction fraud detection BIBAFull-Text 669-678
  Liang Zhang; Jie Yang; Belle Tseng
We consider the problem of building online machine-learned models for detecting auction frauds in e-commence web sites. Since the emergence of the world wide web, online shopping and online auction have gained more and more popularity. While people are enjoying the benefits from online trading, criminals are also taking advantages to conduct fraudulent activities against honest parties to obtain illegal profit. Hence proactive fraud-detection moderation systems are commonly applied in practice to detect and prevent such illegal and fraud activities. Machine-learned models, especially those that are learned online, are able to catch frauds more efficiently and quickly than human-tuned rule-based systems. In this paper, we propose an online probit model framework which takes online feature selection, coefficient bounds from human knowledge and multiple instance learning into account simultaneously. By empirical experiments on a real-world online auction fraud detection data we show that this model can potentially detect more frauds and significantly reduce customer complaints compared to several baseline models and the human-tuned rule-based system.
Serf and turf: crowdturfing for fun and profit BIBAFull-Text 679-688
  Gang Wang; Christo Wilson; Xiaohan Zhao; Yibo Zhu; Manish Mohanlal; Haitao Zheng; Ben Y. Zhao
Popular Internet services in recent years have shown that remarkable things can be achieved by harnessing the power of the masses using crowd-sourcing systems. However, crowd-sourcing systems can also pose a real challenge to existing security mechanisms deployed to protect Internet services. Many of these security techniques rely on the assumption that malicious activity is generated automatically by automated programs. Thus they would perform poorly or be easily bypassed when attacks are generated by real users working in a crowd-sourcing system. Through measurements, we have found surprising evidence showing that not only do malicious crowd-sourcing systems exist, but they are rapidly growing in both user base and total revenue. We describe in this paper a significant effort to study and understand these "crowdturfing" systems in today's Internet. We use detailed crawls to extract data about the size and operational structure of these crowdturfing systems. We analyze details of campaigns offered and performed in these sites, and evaluate their end-to-end effectiveness by running active, benign campaigns of our own. Finally, we study and compare the source of workers on crowdturfing sites in different countries. Our results suggest that campaigns on these systems are highly effective at reaching users, and their continuing growth poses a concrete threat to online communities both in the US and elsewhere.

Social interactions and the web

Actions speak as loud as words: predicting relationships from social behavior data BIBAFull-Text 689-698
  Sibel Adali; Fred Sisenda; Malik Magdon-Ismail
In recent years, new studies concentrating on analyzing user personality and finding credible content in social media have become quite popular. Most such work augments features from textual content with features representing the user's social ties and the tie strength. Social ties are crucial in understanding the network the people are a part of. However, textual content is extremely useful in understanding topics discussed and the personality of the individual. We bring a new dimension to this type of analysis with methods to compute the type of ties individuals have and the strength of the ties in each dimension. We present a new genre of behavioral features that are able to capture the "function" of a specific relationship without the help of textual features. Our novel features are based on the statistical properties of communication patterns between individuals such as reciprocity, assortativity, attention and latency. We introduce a new methodology for determining how such features can be compared to textual features, and show, using Twitter data, that our features can be used to capture contextual information present in textual features very accurately. Conversely, we also demonstrate how textual features can be used to determine social attributes related to an individual.
Echoes of power: language effects and power differences in social interaction BIBAFull-Text 699-708
  Cristian Danescu-Niculescu-Mizil; Lillian Lee; Bo Pang; Jon Kleinberg
Understanding social interaction within groups is key to analyzing online communities. Most current work focuses on structural properties: who talks to whom, and how such interactions form larger network structures. The interactions themselves, however, generally take place in the form of natural language -- either spoken or written -- and one could reasonably suppose that signals manifested in language might also provide information about roles, status, and other aspects of the group's dynamics. To date, however, finding domain-independent language-based signals has been a challenge.
   Here, we show that in group discussions, power differentials between participants are subtly revealed by how much one individual immediately echoes the linguistic style of the person they are responding to. Starting from this observation, we propose an analysis framework based on linguistic coordination that can be used to shed light on power relationships and that works consistently across multiple types of power -- including a more "static" form of power based on status differences, and a more "situational" form of power in which one individual experiences a type of dependence on another. Using this framework, we study how conversational behavior can reveal power relationships in two very different settings: discussions among Wikipedians and arguments before the U. S. Supreme Court.
Bimodal invitation-navigation fair bets model for authority identification in a social network BIBAFull-Text 709-718
  Suratna Budalakoti; Ron Bekkerman
We consider the problem of identifying the most respected, authoritative members of a large-scale online social network (OSN) by constructing a global ranked list of its members. The problem is distinct from the problem of identifying influencers: we are interested in identifying members who are influential in the real world, even when not necessarily so on the OSN. We focus on two sources for information about user authority: (a) invitations to connect, which are usually sent to people whom the inviter respects, and (b) members' browsing behavior, as profiles of more important people are viewed more often than others'. We construct two directed graphs over the same set of nodes (representing member profiles): the invitation graph and the navigation graph respectively. We show that the standard PageRank algorithm, a baseline in web page ranking, is not effective in people ranking, and develop a social capital based model, called the fair bets model, as a viable solution. We then propose a novel approach, called bimodal fair bets, for combining information from two (or more) endorsement graphs drawn from the same OSN, by simultaneously using the authority scores of nodes in one graph to inform the other, and vice versa, in a mutually reinforcing fashion. We evaluate the ranking results on the LinkedIn social network using this model, where members who have Wikipedia profiles are assumed to be authoritative. Experimental results show that our approach outperforms the baseline approach by a large margin.

Entity and taxonomy extraction

Targeted disambiguation of ad-hoc, homogeneous sets of named entities BIBAFull-Text 719-728
  Chi Wang; Kaushik Chakrabarti; Tao Cheng; Surajit Chaudhuri
In many entity extraction applications, the entities to be recognized are constrained to be from a list of "target entities". In many cases, these target entities are (i) ad-hoc, i.e., do not exist in a knowledge base and (ii) homogeneous (e.g., all the entities are IT companies). We study the following novel disambiguation problem in this unique setting: given the candidate mentions of all the target entities, determine which ones are true mentions of a target entity. Prior techniques only consider target entities present in a knowledge base and/or having a rich set of attributes. In this paper, we develop novel techniques that require no knowledge about the entities except their names. Our main insight is to leverage the homogeneity constraint and disambiguate the candidate mentions collectively across all documents. We propose a graph-based model, called MentionRank, for that purpose. Furthermore, if additional knowledge is available for some or all of the entities, our model can leverage it to further improve quality. Our experiments demonstrate the effectiveness of our model. To the best of our knowledge, this is the first work on targeted entity disambiguation for ad-hoc entities.
Collective context-aware topic models for entity disambiguation BIBAFull-Text 729-738
  Prithviraj Sen
A crucial step in adding structure to unstructured data is to identify references to entities and disambiguate them. Such disambiguated references can help enhance readability and draw similarities across different pieces of running text in an automated fashion. Previous research has tackled this problem by first forming a catalog of entities from a knowledge base, such as Wikipedia, and then using this catalog to disambiguate references in unseen text. However, most of the previously proposed models either do not use all text in the knowledge base, potentially missing out on discriminative features, or do not exploit word-entity proximity to learn high-quality catalogs. In this work, we propose topic models that keep track of the context of every word in the knowledge base; so that words appearing within the same context as an entity are more likely to be associated with that entity. Thus, our topic models utilize all text present in the knowledge base and help learn high-quality catalogs. Our models also learn groups of co-occurring entities thus enabling collective disambiguation. Unlike most previous topic models, our models are non-parametric and do not require the user to specify the exact number of groups present in the knowledge base. In experiments performed on an extract of Wikipedia containing almost 60,000 references, our models outperform SVM-based baselines by as much as 18% in terms of disambiguation accuracy translating to an increment of almost 11,000 correctly disambiguated references.
Document hierarchies from text and links BIBAFull-Text 739-748
  Qirong Ho; Jacob Eisenstein; Eric P. Xing
Hierarchical taxonomies provide a multi-level view of large document collections, allowing users to rapidly drill down to fine-grained distinctions in topics of interest. We show that automatically induced taxonomies can be made more robust by combining text with relational links. The underlying mechanism is a Bayesian generative model in which a latent hierarchical structure explains the observed data -- thus, finding hierarchical groups of documents with similar word distributions and dense network connections. As a nonparametric Bayesian model, our approach does not require pre-specification of the branching factor at each non-terminal, but finds the appropriate level of detail directly from the data. Unlike many prior latent space models of network structure, the complexity of our approach does not grow quadratically in the number of documents, enabling application to networks with more than ten thousand nodes. Experimental results on hypertext and citation network corpora demonstrate the advantages of our hierarchical, multimodal approach.

Leveraging user-generated content

Mining photo-sharing websites to study ecological phenomena BIBAFull-Text 749-758
  Haipeng Zhang; Mohammed Korayem; David J. Crandall; Gretchen LeBuhn
The popularity of social media websites like Flickr and Twitter has created enormous collections of user-generated content online. Latent in these content collections are observations of the world: each photo is a visual snapshot of what the world looked like at a particular point in time and space, for example, while each tweet is a textual expression of the state of a person and his or her environment. Aggregating these observations across millions of social sharing users could lead to new techniques for large-scale monitoring of the state of the world and how it is changing over time. In this paper we step towards that goal, showing that by analyzing the tags and image features of geo-tagged, time-stamped photos we can measure and quantify the occurrence of ecological phenomena including ground snow cover, snow fall and vegetation density. We compare several techniques for dealing with the large degree of noise in the dataset, and show how machine learning can be used to reduce errors caused by misleading tags and ambiguous visual content. We evaluate the accuracy of these techniques by comparing to ground truth data collected both by surface stations and by Earth-observing satellites. Besides the immediate application to ecology, our study gives insight into how to accurately crowd-source other types of information from large, noisy social sharing datasets.
Learning from the past: answering new questions with past answers BIBAFull-Text 759-768
  Anna Shtok; Gideon Dror; Yoelle Maarek; Idan Szpektor
Community-based Question Answering sites, such as Yahoo! Answers or Baidu Zhidao, allow users to get answers to complex, detailed and personal questions from other users. However, since answering a question depends on the ability and willingness of users to address the asker's needs, a significant fraction of the questions remain unanswered. We measured that in Yahoo! Answers, this fraction represents 15% of all incoming English questions. At the same time, we discovered that around 25% of questions in certain categories are recurrent, at least at the question-title level, over a period of one year.
   We attempt to reduce the rate of unanswered questions in Yahoo! Answers by reusing the large repository of past resolved questions, openly available on the site. More specifically, we estimate the probability whether certain new questions can be satisfactorily answered by a best answer from the past, using a statistical model specifically trained for this task. We leverage concepts and methods from query-performance prediction and natural language processing in order to extract a wide range of features for our model. The key challenge here is to achieve a level of quality similar to the one provided by the best human answerers.
   We evaluated our algorithm on offline data extracted from Yahoo! Answers, but more interestingly, also on online data by using three "live" answering robots that automatically provide past answers to new questions when a certain degree of confidence is reached. We report the success rate of these robots in three active Yahoo! Answers categories in terms of both accuracy, coverage and askers' satisfaction. This work presents a first attempt, to the best of our knowledge, of automatic question answering to questions of social nature, by reusing past answers of high quality.
Discovering geographical topics in the twitter stream BIBAFull-Text 769-778
  Liangjie Hong; Amr Ahmed; Siva Gurumurthy; Alexander J. Smola; Kostas Tsioutsiouliklis
Micro-blogging services have become indispensable communication tools for online users for disseminating breaking news, eyewitness accounts, individual expression, and protest groups. Recently, Twitter, along with other online social networking services such as Foursquare, Gowalla, Facebook and Yelp, have started supporting location services in their messages, either explicitly, by letting users choose their places, or implicitly, by enabling geo-tagging, which is to associate messages with latitudes and longitudes. This functionality allows researchers to address an exciting set of questions: 1) How is information created and shared across geographical locations, 2) How do spatial and linguistic characteristics of people vary across regions, and 3) How to model human mobility. Although many attempts have been made for tackling these problems, previous methods are either complicated to be implemented or oversimplified that cannot yield reasonable performance. It is a challenge task to discover topics and identify users' interests from these geo-tagged messages due to the sheer amount of data and diversity of language variations used on these location sharing services. In this paper we focus on Twitter and present an algorithm by modeling diversity in tweets based on topical diversity, geographical diversity, and an interest distribution of the user. Furthermore, we take the Markovian nature of a user's location into account. Our model exploits sparse factorial coding of the attributes, thus allowing us to deal with a large and diverse set of covariates efficiently. Our approach is vital for applications such as user profiling, content recommendation and topic tracking. We show high accuracy in location estimation based on our model. Moreover, the algorithm identifies interesting topics based on location and language.

Data and content management 1

Declarative platform for data sourcing games BIBAFull-Text 779-788
  Daniel Deutch; Ohad Greenshpan; Boris Kostenko; Tova Milo
Harnessing a crowd of users for the collection of mass data (data sourcing) has recently become a wide-spread practice. One effective technique is based on games as a tool that attracts the crowd to contribute useful facts. We focus here on the data management layer of such games, and observe that the development of this layer involves challenges such as dealing with probabilistic data, combined with recursive manipulation of this data. These challenges are difficult to address using current declarative data management framework works, and we thus propose here a novel such framework, and demonstrate its usefulness in expressing different aspects in the data management of Trivia-like games. We have implemented a system prototype with our novel data management framework at its core, and we highlight key issues in the system design, as well as our experimentations that indicate the usefulness and scalability of the approach.
Information integration over time in unreliable and uncertain environments BIBAFull-Text 789-798
  Aditya Pal; Vibhor Rastogi; Ashwin Machanavajjhala; Philip Bohannon
Often an interesting true value such as a stock price, sports score, or current temperature is only available via the observations of noisy and potentially conflicting sources. Several techniques have been proposed to reconcile these conflicts by computing a weighted consensus based on source reliabilities, but these techniques focus on static values. When the real-world entity evolves over time, the noisy sources can delay, or even miss, reporting some of the real-world updates. This temporal aspect introduces two key challenges for consensus-based approaches: (i) due to delays, the mapping between a source's noisy observation and the real-world update it observes is unknown, and (ii) missed updates may translate to missing values for the consensus problem, even if the mapping is known. To overcome these challenges, we propose a formal approach that models the history of updates of the real-world entity as a hidden semi-Markovian process (HSMM). The noisy sources are modeled as observations of the hidden state, but the mapping between a hidden state (i.e. real-world update) and the observation (i.e. source value) is unknown. We propose algorithms based on Gibbs Sampling and EM to jointly infer both the history of real-world updates as well as the unknown mapping between them and the source values. We demonstrate using experiments on real-world datasets how our history-based techniques improve upon history-agnostic consensus-based approaches.
SAFE extensibility of data-driven web applications BIBAFull-Text 799-808
  Raphael M. Reischuk; Michael Backes; Johannes Gehrke
This paper presents a novel method for enabling fast development and easy customization of interactive data-intensive web applications. Our approach is based on a high-level hierarchical programming model that results in both a very clean semantics of the application while at the same time creating well-defined interfaces for customization of application components. A prototypical implementation of a conference management system shows the efficacy of our approach.

Web engineering 1

On the analysis of cascading style sheets BIBAFull-Text 809-818
  Pierre Geneves; Nabil Layaida; Vincent Quint
Developing and maintaining cascading style sheets (CSS) is an important issue to web developers as they suffer from the lack of rigorous methods. Most existing means rely on validators that check syntactic rules, and on runtime debuggers that check the behavior of a CSS style sheet on a particular document instance. However, the aim of most style sheets is to be applied to an entire set of documents, usually defined by some schema. To this end, a CSS style sheet is usually written w.r.t. a given schema. While usual debugging tools help reducing the number of bugs, they do not ultimately allow to prove properties over the whole set of documents to which the style sheet is intended to be applied. We propose a novel approach to fill this lack. We introduce ideas borrowed from the fields of logic and compile-time verification for the analysis of CSS style sheets. We present an original tool based on recent advances in tree logics. The tool is capable of statically detecting a wide range of errors (such as empty CSS selectors and semantically equivalent selectors), as well as proving properties related to sets of documents (such as coverage of styling information), in the presence or absence of schema information. This new tool can be used in addition to existing runtime debuggers to ensure a higher level of quality of CSS style sheets.
Extracting client-side web application code BIBAFull-Text 819-828
  Josip Maras; Jan Carlson; Ivica Crnkovi
The web application domain is one of the fastest growing and most wide-spread application domains today. By utilizing fast, modern web browsers and advanced scripting techniques, web developers are developing highly interactive applications that can, in terms of user-experience and responsiveness, compete with standard desktop applications. A web application is composed of two equally important parts: the server-side and the client-side. The client-side acts as a user-interface to the application, and can be viewed as a collection of behaviors. Similar behaviors are often used in a large number of applications, and facilitating their reuse offers considerable benefits. However, due to client-side specifics, such as multi-language implementation and extreme dynamicity, identifying and extracting code responsible for a certain behavior is difficult. In this paper we present a semi-automatic method for extracting client-side web application code implementing a certain behavior. We show how by analyzing the execution of a usage scenario, code responsible for a certain behavior can be identified, how dependencies between different parts of the application can be tracked, and how in the end only the code responsible for a certain behavior can be extracted. Our evaluation shows that the method is capable of extracting stand-alone behaviors, while achieving considerable savings in terms of code size and application performance.
OPAL: automated form understanding for the deep web BIBAFull-Text 829-838
  Tim Furche; Georg Gottlob; Giovanni Grasso; Xiaonan Guo; Giorgio Orsi; Christian Schallhart
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).

Collaboration in social networks

Online team formation in social networks BIBAFull-Text 839-848
  Aris Anagnostopoulos; Luca Becchetti; Carlos Castillo; Aristides Gionis; Stefano Leonardi
We study the problem of online team formation. We consider a setting in which people possess different skills and compatibility among potential team members is modeled by a social network. A sequence of tasks arrives in an online fashion, and each task requires a specific set of skills. The goal is to form a new team upon arrival of each task, so that (i) each team possesses all skills required by the task, (ii) each team has small communication overhead, and (iii) the workload of performing the tasks is balanced among people in the fairest possible way.
   We propose efficient algorithms that address all these requirements: our algorithms form teams that always satisfy the required skills, provide approximation guarantees with respect to team communication overhead, and they are online-competitive with respect to load balancing. Experiments performed on collaboration networks among film actors and scientists, confirm that our algorithms are successful at balancing these conflicting requirements.
   This is the first paper that simultaneously addresses all these aspects. Previous work has either focused on minimizing coordination for a single task or balancing the workload neglecting coordination costs.
Understanding task-driven information flow in collaborative networks BIBAFull-Text 849-858
  Gengxin Miao; Shu Tao; Winnie Cheng; Randy Moulic; Louise E. Moser; David Lo; Xifeng Yan
Collaborative networks are a special type of social network formed by members who collectively achieve specific goals, such as fixing software bugs and resolving customers' problems. In such networks, information flow among members is driven by the tasks assigned to the network, and by the expertise of its members to complete those tasks. In this work, we analyze real-life collaborative networks to understand their common characteristics and how information is routed in these networks. Our study shows that collaborative networks exhibit significantly different properties compared with other complex networks. Collaborative networks have truncated power-law node degree distributions and other organizational constraints. Furthermore, the number of steps along which information is routed follows a truncated power-law distribution. Based on these observations, we developed a network model that can generate synthetic collaborative networks subject to certain structure constraints. Moreover, we developed a routing model that emulates task-driven information routing conducted by human beings in a collaborative network. Together, these two models can be used to study the efficiency of information routing for different types of collaborative networks -- a problem that is important in practice yet difficult to solve without the method proposed in this paper.
New objective functions for social collaborative filtering BIBAFull-Text 859-868
  Joseph Noel; Scott Sanner; Khoi-Nguyen Tran; Peter Christen; Lexing Xie; Edwin V. Bonilla; Ehsan Abbasnejad; Nicolas Della Penna
This paper examines the problem of social collaborative filtering (CF) to recommend items of interest to users in a social network setting. Unlike standard CF algorithms using relatively simple user and item features, recommendation in social networks poses the more complex problem of learning user preferences from a rich and complex set of user profile and interaction information. Many existing social CF methods have extended traditional CF matrix factorization, but have overlooked important aspects germane to the social setting. We propose a unified framework for social CF matrix factorization by introducing novel objective functions for training. Our new objective functions have three key features that address main drawbacks of existing approaches: (a) we fully exploit feature-based user similarity, (b) we permit direct learning of user-to-user information diffusion, and (c) we leverage co-preference (dis)agreement between two users to learn restricted areas of common interest. We evaluate these new social CF objectives, comparing them to each other and to a variety of (social) CF baselines, and analyze user behavior on live user trials in a custom-developed Facebook App involving data collected over five months from over 100 App users and their 37,000+ friends.

Information extraction

Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions BIBAFull-Text 869-878
  Kavita Ganesan; ChengXiang Zhai; Evelyne Viegas
This paper presents a new unsupervised approach to generating ultra-concise summaries of opinions. We formulate the problem of generating such a micropinion summary as an optimization problem, where we seek a set of concise and non-redundant phrases that are readable and represent key opinions in text. We measure representativeness based on a modified mutual information function and model readability with an n-gram language model. We propose some heuristic algorithms to efficiently solve this optimization problem. Evaluation results show that our unsupervised approach outperforms other state of the art summarization methods and the generated summaries are informative and readable.
Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce BIBAFull-Text 879-888
  Ke Zhai; Jordan Boyd-Graber; Nima Asadi; Mohamad L. Alkhouja
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference for LDA. In this paper, we introduce a novel and flexible large scale topic modeling package in MapReduce (Mr. LDA). As opposed to other techniques which use Gibbs sampling, our proposed framework uses variational inference, which easily fits into a distributed environment. More importantly, this variational implementation, unlike highly tuned and specialized implementations based on Gibbs sampling, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We compare the scalability of Mr. LDA against Mahout, an existing large scale topic modeling package. Mr. LDA out-performs Mahout both in execution speed and held-out likelihood.

Web mining

Trains of thought: generating information maps BIBAFull-Text 899-908
  Dafna Shahaf; Carlos Guestrin; Eric Horvitz
When information is abundant, it becomes increasingly difficult to fit nuggets of knowledge into a single coherent picture. Complex stories spaghetti into branches, side stories, and intertwining narratives. In order to explore these stories, one needs a map to navigate unfamiliar territory. We propose a methodology for creating structured summaries of information, which we call metro maps. Our proposed algorithm generates a concise structured set of documents maximizing coverage of salient pieces of information. Most importantly, metro maps explicitly show the relations among retrieved pieces in a way that captures story development. We first formalize characteristics of good maps and formulate their construction as an optimization problem. Then we provide efficient methods with theoretical guarantees for generating maps. Finally, we integrate user interaction into our framework, allowing users to alter the maps to better reflect their interests. Pilot user studies with a real-world dataset demonstrate that the method is able to produce maps which help users acquire knowledge efficiently.
Learning causality for news events prediction BIBAFull-Text 909-918
  Kira Radinsky; Sagie Davidovich; Shaul Markovitch
The problem we tackle in this work is, given a present news event, to generate a plausible future event that can be caused by the given event. We present a new methodology for modeling and predicting such future news events using machine learning and data mining techniques. Our Pundit algorithm generalizes examples of causality pairs to infer a causality predictor. To obtain precise labeled causality examples, we mine 150 years of news articles, and apply semantic natural language modeling techniques to titles containing certain predefined causality patterns. For generalization, the model uses a vast amount of world knowledge ontologies mined from LinkedData, containing 200 datasets with approximately 20 billion relations. Empirical evaluation on real news articles shows that our Pundit algorithm reaches a human-level performance.
Your two weeks of fame and your grandmother's BIBAFull-Text 919-928
  James Cook; Atish Das Sarma; Alex Fabrikant; Andrew Tomkins
Did celebrity last longer in 1929, 1992 or 2009? We investigate the phenomenon of fame by mining a collection of news articles that spans the twentieth century, and also perform a side study on a collection of blog posts from the last 10 years. By analyzing mentions of personal names, we measure each person's time in the spotlight, and watch the distribution change from a century ago to a year ago. We expected to find a trend of decreasing durations of fame as news cycles accelerated and attention spans became shorter. Instead, we find a remarkable consistency through most of the period we study. Through a century of rapid technological and societal change, through the appearance of Twitter, communication satellites and the Internet, we do not observe a significant change in typical duration of celebrity. We also study the most famous of the famous, and find different results depending on our method for measuring duration of fame. With a method that may be thought of as measuring a spike of attention around a single narrow news story, we see the same result as before: stories last as long now as they did in 1930. A second method, which may be thought of as measuring the duration of public interest in a person, indicates that famous people's presence in the news is becoming longer rather than shorter, an effect most likely driven by the wider distribution and higher volume of media in modern times. Similar studies have been done with much shorter timescales specifically in the context of information spreading on Twitter and similar social networking site. However, to the best of our knowledge, this is the first massive scale study of this nature that spans over a century of archived data, thereby allowing us to track changes across decades.

Data and content management 2

A unified approach to learning task-specific bit vector representations for fast nearest neighbor search BIBAFull-Text 929-938
  Vinod Nair; Dhruv Mahajan; Sundararajan Sellamanickam
Fast nearest neighbor search is necessary for a variety of large scale web applications such as information retrieval, nearest neighbor classification and nearest neighbor regression. Recently a number of machine learning algorithms have been proposed for representing the data to be searched as (short) bit vectors and then using hashing to do rapid search. These algorithms have been limited in their applicability in that they are suited for only one type of task -- e.g. Spectral Hashing learns bit vector representations for retrieval, but not say, classification. In this paper we present a unified approach to learning bit vector representations for many applications that use nearest neighbor search. The main contribution is a single learning algorithm that can be customized to learn a bit vector representation suited for the task at hand. This broadens the usefulness of bit vector representations to tasks beyond just conventional retrieval. We propose a learning-to-rank formulation to learn the bit vector representation of the data. LambdaRank algorithm is used for learning a function that computes a task-specific bit vector from an input data vector. Our approach outperforms state-of-the-art nearest neighbor methods on a number of real world text and image classification and retrieval datasets. It is scalable and learns a 32-bit representation on 1.46 million training cases in two days.
Lightweight automatic face annotation in media pages BIBAFull-Text 939-948
  Dmitri Perelman; Edward Bortnikov; Ronny Lempel; Roman Sandler
Labeling human faces in images contained in Web media stories enables enriching the user experience offered by media sites. We propose a lightweight framework for automatic image annotation that exploits named entities mentioned in the article to significantly boost the accuracy of face recognition. While previous works in the area labor to train comprehensive offline visual models for a pre-defined universe of candidates, our approach models the people mentioned in a given story on the y, using a standard Web image search engine as an image sampling mechanism. We overcome multiple sources of noise introduced by this ad-hoc process, to build a fast and robust end-to-end system from off-the-shelf error-prone text analysis and machine vision components. In experiments conducted on approximately 900 faces depicted in 500 stories from a major celebrity news website, we were able to correctly label 81.5% of the faces while mislabeling 14.8% of them.
Distributed graph pattern matching BIBAFull-Text 949-958
  Shuai Ma; Yang Cao; Jinpeng Huai; Tianyu Wo
Graph simulation has been adopted for pattern matching to reduce the complexity and capture the need of novel applications. With the rapid development of the Web and social networks, data is typically distributed over multiple machines. Hence a natural question raised is how to evaluate graph simulation on distributed data. To our knowledge, no such distributed algorithms are in place yet. This paper settles this question by providing evaluation algorithms and optimizations for graph simulation in a distributed setting. (1) We study the impacts of components and data locality on the evaluation of graph simulation. (2) We give an analysis of a large class of distributed algorithms, captured by a message-passing model, for graph simulation. We also identify three complexity measures: visit times, makespan and data shipment, for analyzing the distributed algorithms, and show that these measures are essentially controversial with each other. (3) We propose distributed algorithms and optimization techniques that exploit the properties of graph simulation and the analyses of distributed algorithms. (4) We experimentally verify the effectiveness and efficiency of these algorithms, using both real-life and synthetic data.

Web engineering 2

Towards network-aware service composition in the cloud BIBAFull-Text 959-968
  Adrian Klein; Fuyuki Ishikawa; Shinichi Honiden
Service-Oriented Computing (SOC) enables the composition of loosely coupled services provided with varying Quality of Service (QoS) levels. Selecting a (near-)optimal set of services for a composition in terms of QoS is crucial when many functionally equivalent services are available. With the advent of Cloud Computing, both the number of such services and their distribution across the network are rising rapidly, increasing the impact of the network on the QoS of such compositions. Despite this, current approaches do not differentiate between the QoS of services themselves and the QoS of the network. Therefore, the computed latency differs substantially from the actual latency, resulting in suboptimal QoS for service compositions in the cloud. Thus, we propose a network-aware approach that handles the QoS of services and the QoS of the network independently. First, we build a network model in order to estimate the network latency between arbitrary services and potential users. Our selection algorithm then leverages this model to find compositions that will result in a low latency given an employed execution policy. In our evaluation, we show that our approach efficiently computes compositions with much lower latency than current approaches.
Towards robust service compositions in the context of functionally diverse services BIBAFull-Text 969-978
  Florian Wagner; Benjamin Kloepper; Fuyuki Ishikawa; Shinichi Honiden
Web service composition provides a means of customized and flexible integration of service functionalities. Quality-of-Service (QoS) optimization algorithms select services in order to adapt workflows to the non-functional requirements of the user. With increasing number of services in a workflow, previous approaches fail to achieve a sufficient reliability. Moreover, expensive ad-hoc replanning is required to deal with service failures. The major problem with such sequential application of planning and replanning is that it ignores the potential costs during the initial planning and they consequently are hidden from the decision maker. Our basic idea to overcome this substantial problem is to compute a QoS optimized selection of service clusters that includes a sufficient number of backup services for each service employed. To support the human decision maker in the service selection task, our approach considers the possible repair costs directly in the initial composition. On the basis of a multi-objective approach and using a suitable service selection interface, the decision maker can select compositions in line with his/her personal risk preferences.
CloudGenius: decision support for web server cloud migration BIBAFull-Text 979-988
  Michael Menzel; Rajiv Ranjan
Cloud computing is the latest computing paradigm that delivers hardware and software resources as virtualized services in which users are free from the burden of worrying about the low-level system administration details. Migrating Web applications to Cloud services and integrating Cloud services into existing computing infrastructures is non-trivial. It leads to new challenges that often require innovation of paradigms and practices at all levels: technical, cultural, legal, regulatory, and social. The key problem in mapping Web applications to virtualized Cloud services is selecting the best and compatible mix of software images (e.g., Web server image) and infrastructure services to ensure that Quality of Service (QoS) targets of an application are achieved. The fact that, when selecting Cloud services, engineers must consider heterogeneous sets of criteria and complex dependencies between infrastructure services and software images, which are impossible to resolve manually, is a critical issue. To overcome these challenges, we present a framework (called CloudGenius) which automates the decision-making process based on a model and factors specifically for Web server migration to the Cloud. CloudGenius leverages a well known multi-criteria decision making technique, called Analytic Hierarchy Process, to automate the selection process based on a model, factors, and QoS parameters related to an application. An example application demonstrates the applicability of the theoretical CloudGenius approach. Moreover, we present an implementation of CloudGenius that has been validated through experiments.


Max algorithms in crowdsourcing environments BIBAFull-Text 989-998
  Petros Venetis; Hector Garcia-Molina; Kerui Huang; Neoklis Polyzotis
Our work investigates the problem of retrieving the maximum item from a set in crowdsourcing environments. We first develop parameterized families of max algorithms, that take as input a set of items and output an item from the set that is believed to be the maximum. Such max algorithms could, for instance, select the best Facebook profile that matches a given person or the best photo that describes a given restaurant. Then, we propose strategies that select appropriate max algorithm parameters. Our framework supports various human error and cost models and we consider many of them for our experiments. We evaluate under many metrics, both analytically and via simulations, the tradeoff between three quantities: (1) quality, (2) monetary cost, and (3) execution time. Also, we provide insights on the effectiveness of the strategies in selecting appropriate max algorithm parameters and guidelines for choosing max algorithms and strategies for each application.
Crowdsourcing with endogenous entry BIBAFull-Text 999-1008
  Arpita Ghosh; Preston McAfee
We investigate the design of mechanisms to incentivize high quality outcomes in crowdsourcing environments with strategic agents, when entry is an endogenous, strategic choice. Modeling endogenous entry in crowdsourcing markets is important because there is a nonzero cost to making a contribution of any quality which can be avoided by not participating, and indeed many sites based on crowdsourced content do not have adequate participation. We use a mechanism with monotone, rank-based, rewards in a model where agents strategically make participation and quality choices to capture a wide variety of crowdsourcing environments, ranging from conventional crowdsourcing contests with monetary rewards such as TopCoder, to crowdsourced content as in online Q&A forums.
   We begin by explicitly constructing the unique mixed-strategy equilibrium for such monotone rank-order mechanisms, and use the participation probability and distribution of qualities from this construction to address the question of designing incentives for two kinds of rewards that arise in the context of crowdsourcing. We first show that for attention rewards that arise in the crowdsourced content setting, the entire equilibrium distribution and therefore every increasing statistic including the maximum and average quality (accounting for participation), improves when the rewards for every rank but the last are as high as possible. In particular, when the cost of producing the lowest possible quality content is low, the optimal mechanism displays all but the poorest contribution. We next investigate how to allocate rewards in settings where there is a fixed total reward that can be arbitrarily distributed amongst participants, as in crowdsourcing contests. Unlike models with exogenous entry, here the expected number of participants can be increased by subsidizing entry, which could potentially improve the expected value of the best contribution. However, we show that subsidizing entry does not improve the expected quality of the best contribution, although it may improve the expected quality of the average contribution. In fact, we show that free entry is dominated by taxing entry -- making all entrants pay a small fee, which is rebated to the winner along with whatever rewards were already assigned, can improve the quality of the best contribution over a winner-take-all contest with no taxes.
Answering search queries with CrowdSearcher BIBAFull-Text 1009-1018
  Alessandro Bozzon; Marco Brambilla; Stefano Ceri
Web users are increasingly relying on social interaction to complete and validate the results of their search activities. While search systems are superior machines to get world-wide information, the opinions collected within friends and expert/local communities can ultimately determine our decisions: human curiosity and creativity is often capable of going much beyond the capabilities of search systems in scouting "interesting" results, or suggesting new, unexpected search directions. Such personalized interaction occurs in most times aside of the search systems and processes, possibly instrumented and mediated by a social network; when such interaction is completed and users resort to the use of search systems, they do it through new queries, loosely related to the previous search or to the social interaction. In this paper we propose CrowdSearcher, a novel search paradigm that embodies crowds as first-class sources for the information seeking process. CrowdSearcher aims at filling the gap between generalized search systems, which operate upon world-wide information -- including facts and recommendations as crawled and indexed by computerized systems -- with social systems, capable of interacting with real people, in real time, to capture their opinions, suggestions, emotions. The technical contribution of this paper is the discussion of a model and architecture for integrating computerized search with human interaction, by showing how search systems can drive and encapsulate social systems. In particular we show how social platforms, such as Facebook, LinkedIn and Twitter, can be used for crowdsourcing search-related tasks; we demonstrate our approach with several prototypes and we report on our experiment upon real user communities.

Social networks

Vertex collocation profiles: subgraph counting for link analysis and prediction BIBAFull-Text 1019-1028
  Ryan N. Lichtenwalter; Nitesh V. Chawla
We introduce the concept of a vertex collocation profile (VCP) for the purpose of topological link analysis and prediction. VCPs provide nearly complete information about the surrounding local structure of embedded vertex pairs. The VCP approach offers a new tool for domain experts to understand the underlying growth mechanisms in their networks and to analyze link formation mechanisms in the appropriate sociological, biological, physical, or other context. The same resolution that gives VCP its analytical power also enables it to perform well when used in supervised models to discriminate potential new links. We first develop the theory, mathematics, and algorithms underlying VCPs. Then we demonstrate VCP methods performing link prediction competitively with unsupervised and supervised methods across several different network families. We conclude with timing results that introduce the comparative performance of several existing algorithms and the practicability of VCP computations on large networks.
Framework and algorithms for network bucket testing BIBAFull-Text 1029-1036
  Liran Katzir; Edo Liberty; Oren Somekh
Bucket testing, also known as split testing, A/B testing, or 0/1 testing, is a widely used method for evaluating users' satisfaction with new features, products, or services. In order not to expose the whole user base to the new service, the mean user satisfaction rate is estimated by exposing the service only to a few uniformly chosen random users. In a recent work, Backstrom and Kleinberg, defined the notion of network bucket testing for social services. In this context, users' interactions are only valid for measurement if some minimal number of their friends are also given the service. The goal is to estimate the mean user satisfaction rate while providing the service to the least number of users. This constraint makes uniform sampling, which is optimal for the traditional case, grossly inefficient. In this paper we introduce a simple general framework for designing and evaluating sampling techniques for network bucket testing. The framework is constructed in a way that sampling algorithms are only required to generate sets of users to which the service should be provided. Given an algorithm, the framework produces an unbiased user satisfaction rate estimator and a corresponding variance bound for any network and any user satisfaction function. Furthermore, we present several simple sampling algorithms that are evaluated using both synthetic and real social networks. Our experiments corroborate the theoretical results and demonstrate the effectiveness of the proposed framework and algorithms.
Winner takes all: competing viruses or ideas on fair-play networks BIBAFull-Text 1037-1046
  B. Aditya Prakash; Alex Beutel; Roni Rosenfeld; Christos Faloutsos
Given two competing products (or memes, or viruses etc.) spreading over a given network, can we predict what will happen at the end, that is, which product will 'win', in terms of highest market share? One may naively expect that the better product (stronger virus) will just have a larger footprint, proportional to the quality ratio of the products (or strength ratio of the viruses). However, we prove the surprising result that, under realistic conditions, for any graph topology, the stronger virus completely wipes-out the weaker one, thus not merely 'winning' but 'taking it all'. In addition to the proofs, we also demonstrate our result with simulations over diverse, real graph topologies, including the social-contact graph of the city of Portland OR (about 31 million edges and 1 million nodes) and internet AS router graphs. Finally, we also provide real data about competing products from Google-Insights, like Facebook-Myspace, and we show again that they agree with our analysis.

User interfaces and human factors

A dual-mode user interface for accessing 3D content on the world wide web BIBAFull-Text 1047-1056
  Jacek Jankowski; Stefan Decker
The Web evolved from a text-based system to the current rich and interactive medium that supports images, 2D graphics, audio and video. The major media type that is still missing is 3D graphics. Although various approaches have been proposed (most notably VRML/X3D), they have not been widely adopted. One reason for the limited acceptance is the lack of 3D interaction techniques that are optimal for the hypertext-based Web interface. We present a novel strategy for accessing integrated information spaces, where hypertext and 3D graphics data are simultaneously available and linked. We introduce a user interface that has two modes between which a user can switch anytime: the driven by simple hypertext-based interactions "don't-make-me-think" mode, where a 3D scene is embedded in hypertext and the more immersive 3D "take-me-to-the-Wonderland" mode, which immerses the hypertextual annotations into the 3D scene. A user study is presented, which characterizes the user interface in terms of its efficiency and usability.
Exploiting single-user web applications for shared editing: a generic transformation approach BIBAFull-Text 1057-1066
  Matthias Heinrich; Franz Lehmann; Thomas Springer; Martin Gaedke
In the light of the Web 2.0 movement, web-based collaboration tools such as Google Docs have become mainstream and in the meantime serve millions of users. Apart from established collaborative web applications, numerous web editors lack multi-user support even though they are suitable for collaborative work. Enhancing these single-user editors with shared editing capabilities is a costly endeavor since the implementation of a collaboration infrastructure (accommodating conflict resolution, document synchronization, etc.) is required. In this paper, we present a generic transformation approach capable of converting single-user web editors into multi-user editors. Since our approach only requires the configuration of a generic collaboration infrastructure (GCI), the effort to inject shared editing support is significantly reduced in contrast to conventional implementation approaches neglecting reuse. We also report on experimental results of a user study showing that converted editors meet user requirements with respect to software and collaboration qualities. Moreover, we define the characteristics that editors must adhere to in order to leverage the GCI.
"It's simply integral to what I do": enquiries into how the web is weaved into everyday life BIBAFull-Text 1067-1076
  Siân E. Lindley; Sam Meek; Abigail Sellen; Richard Harper
This paper presents findings from a field study of 24 individuals who kept diaries of their web use, across device and location, for a period of four days. Our focus was on how the web was used for non-work purposes, with a view to understanding how this is intertwined with everyday life. While our initial aim was to update existing frameworks of 'web activities', such as those described by Sellen et al. [25] and Kellar et al. [14], our data lead us to suggest that the notion of 'web activity' is only partially useful for an analytic understanding of what it is that people do when they go online. Instead, our analysis leads us to present five modes of web use, which can be used to frame and enrich interpretations of 'activity'. These are respite, orienting, opportunistic use, purposeful use and lean-back internet. We then consider two properties of the web that enable it to be tailored to these different modes, persistence and temporality, and close by suggesting ways of drawing upon these qualities in order to inform design.

WWW 2012-04-16 Volume 2

Industry track presentations

Web-scale user modeling for targeting BIBAFull-Text 3-12
  Mohamed Aly; Andrew Hatch; Vanja Josifovski; Vijay K. Narayanan
We present the experiences from building a web-scale user modeling platform for optimizing display advertising targeting at Yahoo!. The platform described in this paper allows for per-campaign maximization of conversions representing purchase activities or transactions. Conversions directly translate to advertiser's revenue, and thus provide the most relevant metrics of return on advertising investment. We focus on two major challenges: how to efficiently process histories of billions of users on a daily basis, and how to build per-campaign conversion models given the extremely low conversion rates (compared to click rates in a traditional setting). We first present mechanisms for building web-scale user profiles in a daily incremental fashion. Second, we show how to reduce the latency through in-memory processing of billions of user records. Finally, we discuss a technique for scaling the number of handled campaigns/models by introducing an efficient labeling technique that allows for sharing negative training examples across multiple campaigns.
Outage detection via real-time social stream analysis: leveraging the power of online complaints BIBAFull-Text 13-22
  Eriq Augustine; Cailin Cushing; Alex Dekhtyar; Kevin McEntee; Kimberly Paterson; Matt Tognetti
Over the past couple of years, Netflix has significantly expanded its online streaming offerings, which now encompass multiple delivery platforms and thousands of titles available for instant view. This paper documents the design and development of an outage detection system for the online services provided by Netflix. Unlike other internal quality-control measures used at Netflix, this system uses only publicly available information: the tweets, or Twitter posts, that mention the word "Netflix," and has been developed and deployed externally, on servers independent of the Netflix infrastructure. This paper discussed the system and provides assessment of the accuracy of its real-time detection and alert mechanisms.
Optimizing user exploring experience in emerging e-commerce products BIBAFull-Text 23-32
  Xiubo Geng; Xin Fan; Jiang Bian; Xin Li; Zhaohui Zheng
E-commerce has emerged as a popular channel for Web users to conduct transaction over Internet. In e-commerce services, users usually prefer to discover information via querying over category browsing, since the hierarchical structure supported by category browsing can provide them a more effective and efficient way to find their interested properties. However, in many emerging e-commerce services, well-defined hierarchical structures are not always available; moreover, in some other e-commerce services, the pre-defined hierarchical structures are too coarse and less intuitive to distinguish properties according to users interests. This will lead to very bad user experience. In this paper, to address these problems, we propose a hierarchical clustering method to build the query taxonomy based on users' exploration behavior automatically, and further propose an intuitive and light-weight approach to construct browsing list for each cluster to help users discover interested items. The advantage of our approach is four folded. First, we build a hierarchical taxonomy automatically, which saves tedious human effort. Second, we provide a fine-grained structure, which can help user reach their interested items efficiently. Third, our hierarchical structure is derived from users' interaction logs, and thus is intuitive to users. Fourth, given the hierarchical structures, for each cluster, we present both frequently clicked items and retrieved results of queries in the category, which provides more intuitive items to users. We evaluate our work by applying it to the exploration task of a real-world e-commerce service, i.e. online shop for smart mobile phone's apps. Experimental results show that our clustering algorithm is efficient and effective to assist users to discover their interested properties, and further comparisons illustrate that the hierarchical topic browsing performs much better than existing category browsing approach (i.e. Android Market mobile apps category) in terms of information exploration.
FoCUS: learning to crawl web forums BIBAFull-Text 33-42
  Jingtian Jiang; Nenghai Yu; Chin-Yew Lin
In this paper, we present FoCUS (Forum Crawler Under Supervision), a supervised web-scale forum crawler. The goal of FoCUS is to only trawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL type recognition problem and show how to learn accurate and effective regular expression patterns of implicit navigation paths from an automatically created training set using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as 5 annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98% effectiveness and 97% coverage on a large set of test forums powered by over 150 different forum software packages.
Answering math queries with search engines BIBAFull-Text 43-52
  Shahab Kamali; Johnson Apacible; Yasaman Hosseinkashi
Conventional search engines such as Bing and Google provide a user with a short answer to some queries as well as a ranked list of documents, in order to better meet her information needs. In this paper we study a class of such queries that we call math. Calculations (e.g. "12% of 24$", "square root of 120"), unit conversions (e.g. "convert 10 meter to feet"), and symbolic computations (e.g. "plot x^2+x+1") are examples of math queries. Among the queries that should be answered, math queries are special because of the infinite combinations of numbers and symbols, and rather few keywords that form them. Answering math queries must be done through real time computations rather than keyword searches or database look ups. The lack of a formal definition for the entire range of math queries makes it hard to automatically identify them all. We propose a novel approach for recognizing and classifying math queries using large scale search logs, and investigate its accuracy through empirical experiments and statistical analysis. It allows us to discover classes of math queries even if we do not know their structures in advance. It also helps to identify queries that are not math even though they might look like math queries.
   We also evaluate the usefulness of math answers based on the implicit feedback from users. Traditional approaches for evaluating the quality of search results mostly rely on the click information and interpret a click on a link as a sign of satisfaction. Answers to math queries do not contain links, therefore such metrics are not applicable to them. In this paper we describe two evaluation metrics that can be applied for math queries, and present the results on a large collection of math queries taken from Bing's search logs.
Hierarchical composable optimization of web pages BIBAFull-Text 53-62
  Ronny Lempel; Ronen Barenboim; Edward Bortnikov; Nadav Golbandi; Amit Kagian; Liran Katzir; Hayim Makabee; Scott Roy; Oren Somekh
The process of creating modern Web media experiences is challenged by the need to adapt the content and presentation choices to dynamic real-time fluctuations of user interest across multiple audiences. We introduce FAME -- a Framework for Agile Media Experiences -- which addresses this scalability problem. FAME allows media creators to define abstract page models that are subsequently transformed into real experiences through algorithmic experimentation. FAME's page models are hierarchically composed of simple building blocks, mirroring the structure of most Web pages. They are resolved into concrete page instances by pluggable algorithms which optimize the pages for specific business goals. Our framework allows retrieving dynamic content from multiple sources, defining the experimentation's degrees of freedom, and constraining the algorithmic choices. It offers an effective separation of concerns in the media creation process, enabling multiple stakeholders with profoundly different skills to apply their crafts and perform their duties independently, composing and reusing each other's work in modular ways.
Delta-reasoner: a semantic web reasoner for an intelligent mobile platform BIBAFull-Text 63-72
  Boris Motik; Ian Horrocks; Su Myeon Kim
To make mobile device applications more intelligent, one can combine the information obtained via device sensors with background knowledge in order to deduce the user's current context, and then use this context to adapt the application's behaviour to the user's needs. In this paper we describe Delta-Reasoner, a key component of the Intelligent Mobile Platform (IMP), which was designed to support context-aware applications running on mobile devices. Context-aware applications and the mobile platform impose unusual requirements on the reasoner, which we have met by incorporating advanced features such as incremental reasoning and continuous query evaluation into our reasoner. Although we have so far been able to conduct only a very preliminary performance evaluation, our results are very encouraging: our reasoner exhibits sub-second response time on ontologies whose size significantly exceeds the size of the ontologies used in the IMP.
Rewriting null e-commerce queries to recommend products BIBAFull-Text 73-82
  Gyanit Singh; Nish Parikh; Neel Sundaresan
In e-commerce applications product descriptions are often concise. E-Commerce search engines often have to deal with queries that cannot be easily matched to product inventory resulting in zero recall or null query situations. Null queries arise from differences in buyer and seller vocabulary or from the transient nature of products. In this paper, we describe a system that rewrites null e-commerce queries to find matching products as close to the original query as possible. The system uses query relaxation to rewrite null queries in order to match products. Using eBay as an example of a dynamic marketplace, we show how using temporal feedback that respects product category structure using the repository of expired products, we improve the quality of recommended results. The system is scalable and can be run in a high volume setting. We show through our experiments that high quality product recommendations for more than 25% of null queries are achievable.
Towards expressive exploratory search over entity-relationship data BIBAFull-Text 83-92
  Sivan Yogev; Haggai Roitman; David Carmel; Naama Zwerdling
In this paper we describe a novel approach for exploratory search over rich entity-relationship data that utilizes a unique combination of expressive, yet intuitive, query language, faceted search, and graph navigation. We describe an extended faceted search solution which allows to index, search, and browse rich entity-relationship data. We report experimental results of an evaluation study, using a benchmark of several of entity-relationship datasets, demonstrating that our exploratory approach is both effective and efficient compared to other existing approaches.
Data extraction from web pages based on structural-semantic entropy BIBAFull-Text 93-102
  Xiaoqing Zheng; Yiling Gu; Yinsheng Li
Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.
Clustering and load balancing optimization for redundant content removal BIBAFull-Text 103-112
  Shanzhong Zhu; Alexandra Potapova; Maha Alabduljalil; Xin Liu; Tao Yang
Removing redundant content is an important data processing operation in search engines and other web applications. An offline approach can be important for reducing the engine's cost, but it is challenging to scale such an approach for a large data set which is updated continuously. This paper discusses our experience in developing a scalable approach with parallel clustering that detects and removes near duplicates incrementally when processing billions of web pages. It presents a multidimensional mapping to balance the load among multiple machines. It further describes several approximation techniques to efficiently manage distributed duplicate groups with transitive relationship. The experimental results evaluate the efficiency and accuracy of the incremental clustering, assess the effectiveness of the multidimensional mapping, and demonstrate the impact on online cost reduction and search quality.

PhD Symposium

From linked data to linked entities: a migration path BIBAFull-Text 115-120
  Giovanni Bartolomeo; Stefano Salsano
Entities have been deserved special attention in the latest years, however their identification is still troublesome. Existing approaches exploit ad hoc services or centralized architectures. In this paper we present a novel approach to recognize naturally emerging entity identifiers built on top of Linked Data concepts and protocols.
Cyberbullying detection: a step toward a safer internet yard BIBAFull-Text 121-126
  Maral Dadvar; Franciska de Jong
As a result of the invention of social networks friendships, relationships and social communications have all gone to a new level with new definitions. One may have hundreds of friends without even seeing their faces. Meanwhile, alongside this transition there is increasing evidence that online social applications have been used by children and adolescents for bullying. State-of-the-art studies in cyberbullying detection have mainly focused on the content of the conversations while largely ignoring the users involved in cyberbullying. We propose that incorporation of the users' information, their characteristics, and post-harassing behaviour, for instance, posting a new status in another social network as a reaction to their bullying experience, will improve the accuracy of cyberbullying detection. Cross-system analyses of the users' behaviour -- monitoring their reactions in different online environments -- can facilitate this process and provide information that could lead to more accurate detection of cyberbullying.
Intelligent crawling of web applications for web archiving BIBAFull-Text 127-132
  Muhammad Faheem
The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work on the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archival Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.
Binary RDF for scalable publishing, exchanging and consumption in the web of data BIBAFull-Text 133-138
  Javier D. Fernández
The Web of Data is increasingly producing large RDF datasets from diverse fields of knowledge, pushing the Web to a data-to-data cloud. However, traditional RDF representations were inspired by a document-centric view, which results in verbose/redundant data, costly to exchange and post-process. This article discusses an ongoing doctoral thesis addressing efficient formats for publication, exchange and consumption of RDF on a large scale. First, a binary serialization format for RDF, called HDT, is proposed. Then, we focus on compressed rich-functional structures which take part of efficient HDT representation as well as most applications performing on huge RDF datasets.
User-generated metadata in audio-visual collections BIBAFull-Text 139-144
  Riste Gligorov
In recent years, crowdsourcing has gained attention as an alternative method for collecting video annotations. An example is the internet video labeling game Waisda? launched by the Netherlands Institute for Sound and Vision. The goal of this PhD research is to investigate the value of the user tags collected with this video labeling game. To this end, we address the following four issues. First, we perform a comparative analysis between user-generated tags and professional annotations in terms of what aspects of videos they describe. Second, we measure how well user tags are suited for fragment retrieval and compare it with fragment search based on other sources like transcripts and professional annotations. Third, as previous research suggested that user tags predominately refer to objects and rarely describe scenes, we will study whether user tags can be successfully exploited to generate scene-level descriptions. Finally, we investigate how tag quality can be characterized and potential methods to improve it.
Scalable search platform: improving pipelined query processing for distributed full-text retrieval BIBAFull-Text 145-150
  Simon Jonassen
In theory, term-wise partitioned indexes may provide higher throughput than document-wise partitioned. In practice, term-wise partitioning shows lacking scalability with increasing collection size and intra-query parallelism, which leads to long query latency and poor performance at low query loads. In our work, we have developed several techniques to deal with these problems. Our current results show a significant improvement over the state-of-the-art approach on a small distributed IR system, and our next objective is to evaluate the scalability of the improved approach on a large system. In this paper, we describe the relation between our work and the problem of scalability, summarize the results, limitations and challenges of our current work, and outline directions for further research.
Building reputation and trust using federated search and opinion mining BIBAFull-Text 151-154
  Somayeh Khatiban
The term online reputation addresses trust relationships amongst agents in dynamic open systems. These can appear as ratings, recommendations, referrals and feedback. Several reputation models and rating aggregation algorithms have been proposed. However, finding a trusted entity on the web is still an issue as all reputation systems work individually. The aim of this project is to introduce a global reputation system that aggregates people's opinions from different resources (e.g. e-commerce websites, and review) with the help federated search techniques. A sentiment analysis approach is subsequently used to extract high quality opinions and inform how to increase trust in the search result.
A semantic policy sharing and adaptation infrastructure for pervasive communities BIBAFull-Text 155-160
  Vikash Kumar
Rule based information processing has traditionally been vital in many aspects of business, process manufacturing and information science. The need for rules gets even more magnified when limitations of ontology development in OWL are taken into account. In conjunction, the potent combination of ontology and rule based applications could be the future of information processing and knowledge representation on the web. However, semantic rules tend to be very dependent on multitudes of parameters and context data making it less flexible for use in applications where users could benefit from each other by socially sharing intelligence in the form of policies. This work aims to address this issue arising in rule based semantic applications in the use cases of smart home communities and privacy aware m-commerce setting for mobile users. In this paper, we propose a semantic policy sharing and adaptation infrastructure that enables a semantic rule created in one set of environmental, physical and contextual settings to be adapted for use in a situation when those settings/parameters/context variables change. The focus will mainly be on behavioural policies in the smart home use case and privacy enforcing and data filtering policies in the m-commerce scenario. Finally, we look into the possibility of making this solution application independent so that the benefits of such a policy adaptation infrastructure could be exploited in other application settings as well.
A generic graph-based multidimensional recommendation framework and its implementations BIBAFull-Text 161-166
  Sangkeun Lee
As the volume of information on the Web is explosively growing, recommender systems have become essential tools for helping users to find what they need or prefer. Most existing systems are two-dimensional in that they only exploit User and Item dimensions and perform a typical form of recommendation 'Recommending Item to User'. Yet, in many applications, the capabilities of dealing with multidimensional information and of adapting to various forms of recommendation requests are very important. In this paper, we take a graph-based approach to accomplishing such requirements in recommender systems and present a generic graph-based multidimensional recommendation framework. Based on the framework, we propose two homogeneous graph-based and one heterogeneous graph-based multidimensional recommendation methods. We expect our approach will be useful for increasing recommendation performance and enabling flexibility of recommender systems so that they can incorporate various user intentions into their recommendation process. We present our research result that we have reached and discuss remaining challenges and future work.
Semi-automatic semantic moderation of web annotations BIBAFull-Text 167-172
  Elaheh Momeni
Many social media portals are featuring annotation functionality in order to integrate the end users' knowledge with existing digital curation processes. This facilitates extending existing metadata about digital resources. However, due to various levels of annotators' expertise, the quality of annotations can vary from excellent to vague. The evaluation and moderation of annotations (be they troll, vague, or helpful) have not been sufficiently analyzed automatically. Available approaches mostly attempt to solve the problem by using distributed moderation systems, which are influenced by factors affecting accuracy (such as imbalance voting). Despite this, we hypothesize that analyzing and exploiting both content and context dimensions of annotations may assist the automatic moderation process. In this research, we focus on leveraging the context and content features of social web annotations for semi-automatic semantic moderation. This paper describes the vision of our research, proposes an approach for semi-automatic semantic moderation, introduces an ongoing effort from which we collect data that can serve as a basis for evaluating our assumption, and report on lessons learned so far.
Modeling the flow and change of information on the web BIBAFull-Text 173-178
  Nataliia Pobiedina
The proposed PhD work approaches the problem of information flow and change on the Web. To model temporal dynamics both of the Web structure and its content, the author proposes to apply the framework of stochastic graph transformation systems. This framework is currently widely used in software engineering and model checking. A quantitative and qualitative evaluation of the framework will be performed during a case study of the short-term temporal behavior of economics news on selected English news websites and blogs over selected time period.
Context-aware image semantic extraction in the social web BIBAFull-Text 179-184
  Massimiliano Ruocco
Media sharing applications such as Panoramio and Flickr contain a huge amount of pictures that need to be organized to facilitate browsing and retrieval. Such pictures are often surrounded by a set of metadata or image tags, constituting the image context. With the advent of the paradigm of Web 2.0 especially the past five years, the concept of image context has further evolved, allowing users to tag their own and other people's pictures. Focusing on tagging, we distinguish between static and dynamic features. The set of static features include textual and visual features, as well as the contextual information. Further, we may identify other features belonging to the social context as a result of the usage within the media sharing applications. Due to their dynamic nature, we call these the dynamic set of features. In this work, we assume that every media uploaded contains both static and dynamic features. In addition, a user may be linked with other users with whom he/she shares common interests. This has resulted in a new series of challenges within the research field of semantic understanding. One of the main goals of this work is to address these challenges.
Augmenting the web with accountability BIBAFull-Text 185-190
  Oshani Wasana Seneviratne
Given the ubiquity of data on the web, and the lack of usage restriction enforcement mechanisms, stories of personal, creative and other kinds of data misuses are on the rise. There should be both sociological and technological mechanisms that facilitate accountability on the web that would prevent such data misuses. Sociological mechanisms appeal to the data consumer's self-interest in adhering to the data provider's desires. This involves a system of rewards such as recognition and financial incentives, and deterrents such as prohibitions by laws for any violations and social pressure. Bur there is no well-defined technological mechanism for the discovery of accountability or the lack of it on the web. As part of my PhD thesis I propose a solution to this problem by designing a web protocol called HTTPA (Accountable HTTP). This protocol will enable data consumers and data producers to agree to specific usage restrictions, preserve the provenance of data transferred from a web server to a client and back to another web server, and more importantly provide a mechanism to derive an 'audit trail' for the data reuse with the help of a trusted intermediary called a 'Provenance Tracker Network'.
AMBER: turning annotations into knowledge BIBAFull-Text 191-196
  Cheng Wang
Web extraction is the task of turning unstructured HTML into knowledge. Computers are able to generate annotations of unstructured HTML, but it is more important to turn those annotations into structured knowledge. Unfortunately, the current systems extracting knowledge from result pages lack accuracy.
   In this proposal, we present AMBER, a system fully automated turning annotations to structured knowledge from any result page of a given domain. AMBER observes basic domain attributes on a page and leverages repeated occurrences of similar attributes to group related attributes into records. This contrasts to previous approaches that analyze the repeated structure only of the HTML, as no domain knowledge is available. Our multi-domain experimental evaluation on hundreds of sites demonstrates that AMBER achieves accuracy (>98%) comparable to skilled human annotator.
Chinese news event 5W1H semantic elements extraction for event ontology population BIBAFull-Text 197-202
  Wei Wang
To relieve "News Information Overload", in this paper, we propose a novel approach of 5W1H (who, what, whom, when, where, how) event semantic elements extraction for Chinese news event knowledge base construction. The approach comprises a key event identification step, an event semantic elements extraction step and an event ontology population step. We first use a machine learning method to identify the key events from Chinese news stories. Then we extract event 5W1H elements by employing the combination of SRL, NER technique and rule-based method. At last we populate the extracted facts of news events to NOEM, an event ontology designed specifically for modeling semantic elements and relations of events. Our experiments on real online news data sets show the reasonability and feasibility of our approach.
Semi-structured semantic overlay for information retrieval in self-organizing networks BIBAFull-Text 203-208
  Yulian Yang
As scalability and flexibility have become the critical concerns in information management systems, self-organizing networks attract attentions from both research and industrial communities. This work proposes a semi-structured semantic overlay for information retrieval in large-scale self-organizing networks. With the autonomy to their own resources, the nodes are organized into a semantic overlay hosting topically discriminative communities. For information retrieval within a community, unstructured routing approach is employed for the sake of flexibility; While for joining new nodes and routing queries to a distant community, a structured mechanism is designed to save the traffic and time cost. Different from the semantic overlay in the literature, our proposal has three contributions: 1. we design topic-based indexing to form and maintain the semantic overlay, to guarantee both scalability and efficiency; 2. We introduce unstructured routing approach within the community, to allow flexible node joining and leaving; 3. We take advantage of the interaction among nodes to capture the overlay changes and make corresponding adaption in topic-based indexing.

European track presentations

The ERC webdam on foundations of web data management BIBAFull-Text 211-214
  Serge Abiteboul; Pierre Senellart; Victor Vianu
The Webdam ERC grant is a five-year project that started in December 2008. The goal is to develop a formal model for Web data management that would open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability. Specifically, the goal is to develop a universally accepted formal framework for describing complex and flexible interacting Web applications featuring notably data exchange, sharing, integration, querying, and updating. We also propose to develop formal foundations that will enable peers to concurrently reason about global data management activities, cooperate in solving specific tasks, and support services with desired quality of service. Although the proposal addresses fundamental issues, its goal is to serve as the basis for future software development for Web data management.
WAI-ACT: web accessibility now BIBAFull-Text 215-218
  Shadi Abou-Zahra
The W3C web accessibility standards have now existed for over a decade yet implementation of accessible websites, software, and web technologies is lagging behind this development. This fact is largely due to lack of knowledge and expertise among developers and due to fragmentation of web accessibility approaches. It is an opportune time to develop authoritative practical guidance and harmonized approaches, and to research potential challenges and opportunities in future technologies in a collaborative setting. The EC-funded WAI-ACT project addresses these needs through use of an open cooperation framework that builds on and extends the existing mechanisms of the W3C Web Accessibility Initiative (WAI). This paper presents the WAI-ACT project and how it will drive accessibility implementation in advanced web technologies.
GLOCAL: event-based retrieval of networked media BIBAFull-Text 219-222
  Pierre Andrews; Francesco De Natale; Sven Buschbeck; Anthony Jameson; Kerstin Bischoff; Claudiu S. Firan; Claudia Niederée; Vasileios Mezaris; Spiros Nikolopoulos; Vanessa Murdock; Adam Rae
The idea of the European project GLOCAL is to use events as the central concept for search, organization and combination of multimedia content from various sources. For this purpose methods for event detection and event matching as well as media analysis are developed. Considered events range from private, over local, to global events.
Combining social web and BPM for improving enterprise performances: the BPM4People approach to social BPM BIBAFull-Text 223-226
  Marco Brambilla; Piero Fraternali; Carmen Karina Vaca Ruiz
Social BPM fuses business process management practices with social networking applications, with the aim of enhancing the enterprise performance by means of a controlled participation of external stakeholders to process design and enactment. This project-centered demonstration paper proposes a model-driven approach to participatory and social enactment of business processes. The approach consists of defining a specific notation for describing Social BPM behaviors (defined as a BPMN 2.0 extension), a methodology, and a technical framework that allows enterprises to implement of social processes as Web applications integrated with public or private Web social networks. The presented work is performed within the BPM4People SME Capacities project.
Entity oriented search and exploration for cultural heritage collections: the EU cultura project BIBAFull-Text 227-230
  David Carmel; Naama Zwerdling; Sivan Yogev
In this paper we describe an entity oriented search and exploration system that we are developing for the EU Cultura project.
The patents retrieval prototype in the MOLTO project BIBAFull-Text 231-234
  Milen Chechev; Meritxell Gonzàlez; Lluís Màrquez; Cristina España-Bonet
This paper describes the patents retrieval prototype developed within the MOLTO project. The prototype aims to provide a multilingual natural language interface for querying the content of patent documents. The developed system is focused on the biomedical and pharmaceutical domain and includes the translation of the patent claims and abstracts into English, French and German. Aiming at the best retrieval results of the patent information and text content, patent documents are preprocessed and semantically annotated. Then, the annotations are stored and indexed in an OWLIM semantic repository, which contains a patent specific ontology and others from different domains. The prototype, accessible online at http://molto-patents.ontotext.com, presents a multilingual natural language interface to query the retrieval system. In MOLTO, the multilingualism of the queries is addressed by means of the GF Tool, which provides an easy way to build and maintain controlled language grammars for interlingual translation in limited domains. The abstract representation obtained from the GF is used to retrieve both the matched RDF instances and the list of patents semantically related to the user's search criteria. The online interface allows to browse the retrieved patents and shows on the text the semantic annotations that explain the reason why any particular patent has matched the user's criteria.
End-user-oriented telco mashups: the OMELETTE approach BIBAFull-Text 235-238
  Olexiy Chudnovskyy; Tobias Nestler; Martin Gaedke; Florian Daniel; José Ignacio Fernández-Villamor; Vadim Chepegin; José Angel Fornas; Scott Wilson; Christoph Kögler; Heng Chang
With the success of Web 2.0 we are witnessing a growing number of services and APIs exposed by Telecom, IT and content providers. Targeting the Web community and, in particular, Web application developers, service providers expose capabilities of their infrastructures and applications in order to open new markets and to reach new customer groups. However, due to the complexity of the underlying technologies, the last step, i.e., the consumption and integration of the offered services, is a non-trivial and time-consuming task that is still a prerogative of expert developers. Although many approaches to lower the entry barriers for end users exist, little success has been achieved so far. In this paper, we introduce the OMELETTE project and show how it addresses end-user-oriented telco mashup development. We present the goals of the project, describe its contributions, summarize current results, and describe current and future work.
Multilingual online generation from semantic web ontologies BIBAFull-Text 239-242
  Dana Dannélls; Mariana Damova; Ramona Enache; Milen Chechev
In this paper we report on our ongoing work in the EU project Multilingual Online Translation (MOLTO), supported by the European Union Seventh Framework Programme under grant agreement FP7-ICT-247914. More specifically, we present work workpackage 8 (WP8): Case Study: Cultural Heritage. The objective of the work is to build an ontology-based multilingual application for museum information on the Web. Our approach relies on the innovative idea of Reason-able View of the Web of linked data applied to the domain of cultural heritage. We have been developing a Web application that uses Semantic Web ontologies for generating coherent multilingual natural language descriptions about museum objects. We have been experimenting with museum data to test our approach and find that it performs well for the examined languages.
Making use of social media data in public health BIBAFull-Text 243-246
  Kerstin Denecke; Peter Dolog; Pavel Smrz
Disease surveillance systems exist to offer an easily accessible "epidemiological snapshot" on up-to-date summary statistics for numerous infectious diseases. However, these indicator-based systems represent only part of the solution. Experiences show that they fail when confronted with agents that are new emerging like the agents causing the lung disease SARS in 2002. Further, due to slow reporting mechanisms, the time until health threats become visible to public health officials can be long. The M-Eco project provides an event-based approach to the early detection of emerging health threats. The developed technologies exploit content from social media and multimedia data as input and analyze it by sophisticated event-detection techniques to identify potential threats. Alerts for public health threats are provided to the user in a personalized way.
SocialSensor: sensing user generated input for improved media discovery and experience BIBAFull-Text 247-250
  Sotiris Diplaris; Symeon Papadopoulos; Ioannis Kompatsiaris; Ayse Goker; Andrew Macfarlane; Jochen Spangenberg; Hakim Hacid; Linas Maknavicius; Matthias Klusch
SocialSensor will develop a new framework for enabling real-time multimedia indexing and search in the Social Web. The project moves beyond conventional text-based indexing and retrieval models by mining and aggregating user inputs and content over multiple social networking sites. Social Indexing will incorporate information about the structure and activity of the users social network directly into the multimedia analysis and search process. Furthermore, it will enhance the multimedia consumption experience by developing novel user-centric media visualization and browsing paradigms. For example, SocialSensor will analyse the dynamic and massive user contributions in order to extract unbiased trending topics and events and will use social connections for improved recommendations. To achieve its objectives, SocialSensor introduces the concept of Dynamic Social COntainers (DySCOs), a new layer of online multimedia content organisation with particular emphasis on the real-time, social and contextual nature of content and information consumption. Through the proposed DySCOs-centered media search, SocialSensor will integrate social content mining, search and intelligent presentation in a personalized, context and network-aware way, based on aggregation and indexing of both UGC and multimedia Web content.
The multilingual web: report on multilingualweb initiative BIBAFull-Text 251-254
  David Filip; Dave Lewis; Felix Sasaki
We report on the MultilingualWeb initiative, a collaboration between the W3C Internationalization Activity and the European Commission, realized as a series of EC-funded projects. We review the outcomes of "MultilingualWeb", which conducted 4 workshops analyzing "gaps" within Web standardization that currently hinder multilinguality. Gap analysis led to formation of "MultilingualWeb-LT" -- project and W3C Working Group with cross industry representation that will address priority issues via standardization of interoperability metadata.
Mobile web applications: bringing mobile apps and web together BIBAFull-Text 255-258
  Marie-Claire Forgue; Dominique Hazaël-Massieux
The popularity of mobile applications is very high and still growing rapidly. These applications allow their users to stay connected with a large number of service providers in seamless fashion, both for leisure and productivity. But service prThe popularity of mobile applications is very high and still growing rapidly. These applications allow their users to stay connected with a large number of service providers in seamless fashion, both for leisure and productivity. But service providers suffer from the high fragmentation of mobile development platforms that force them to develop, maintain and deploy their applications in a large number of versions and formats. The Mobile Web Applications (MobiWebApp [1]) EU project aims to build on Europe's strength in mobile technologies to enable European research and industry to strengthen its position in Web technologies to be active and visible on the mobile applications market.
The CUBRIK project: human-enhanced time-aware multimedia search BIBAFull-Text 259-262
  Piero Fraternali; Marco Tagliasacchi; Davide Martinenghi; Alessandro Bozzon; Ilio Catallo; Eleonora Ciceri; Francesco Nucci; Vincenzo Croce; Ismail Sengor Altingovde; Wolf Siberski; Fausto Giunchiglia; Wolfgang Nejdl; Martha Larson; Ebroul Izquierdo; Petros Daras; Otto Chrons; Ralph Traphoener; Bjoern Decker; John Lomas; Patrick Aichroth; Jasminko Novak; Ghislain Sillaume; F. Sanchez Figueroa; Carolina Salas-Parra
The Cubrik Project is an Integrated Project of the 7th Framework Programme that aims at contributing to the multimedia search domain by opening the architecture of multimedia search engines to the integration of open source and third party content annotation and query processing components, and by exploiting the contribution of humans and communities in all the phases of multimedia search, from content processing to query processing and relevance feedback processing. The CUBRIK presentation will showcase the architectural concept and scientific background of the project and demonstrate an initial scenario of human-enhanced content and query processing pipeline.
The webinos project BIBAFull-Text 263-266
  Christian Fuhrhop; John Lyle; Shamal Faily
This poster paper describes the webinos project and presents the architecture and security features developed in webinos. It highlights the main objectives and concepts of the project and describes the architecture derived to achieve the objectives.
DIADEM: domain-centric, intelligent, automated data extraction methodology BIBAFull-Text 267-270
  Tim Furche; Georg Gottlob; Giovanni Grasso; Omer Gunes; Xiaoanan Guo; Andrey Kravchenko; Giorgio Orsi; Christian Schallhart; Andrew Sellers; Cheng Wang
Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers -- never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites.
   Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.
Social media meta-API: leveraging the content of social networks BIBAFull-Text 271-274
  George Papadakis; Konstantinos Tserpes; Emmanuel Sardis; Magdalini Kardara; Athanasios Papaoikonomou; Fotis Aisopos
Social Network (SN) environments are the ideal future service marketplaces. It is well known and documented that SN users are increasing at a tremendous pace. Taking advantage of these social dynamics as well as the vast volumes, of amateur content generated every second, is a major step towards creating a potentially huge market of services. In this paper, we describe the external web services that SocIoS project is researching and developing, and will support with the Social Media community. Aiming to support the end users of SNs, to enhance their transactions with more automated ways, and with the advantage for better production and performance in their workflows over SNs inputs and content, this work presents the main architecture, functionality, and benefits per external service. Finally, introduces the end user, into the new era of SNs with business applicability and better social transactions over SNs content.
ARCOMEM: from collect-all ARchives to COmmunity MEMories BIBAFull-Text 275-278
  Thomas Risse; Wim Peters
The ARCOMEM project is about memory institutions like archives, museums and libraries in the age of the Social Web. Social media are becoming more and more pervasive in all areas of life. ARCOMEM's aim is to help to transform archives into collective memories that are more tightly integrated with their community of users and to exploit Web 2.0 and the wisdom of crowds to make Web archiving a more selective and meaning-based process. ARCOMEM (FP7-IST-270239) is an Integrating Project in the FP7 program of the European Commission, which involves twelve partners from academia, industry and public sector. The project will run from January 1, 2011 to December 31, 2013.
Plan4All GeoPortal: web of spatial data BIBAFull-Text 279-282
  Evangelos Sakkopoulos; Tomas Mildorf; Karel Charvat; Inga Berzina; Kai-Uwe Krause
Plan4All project contributes on the harmonization of spatial data and related metadata in order to make them available through Web across a linked data platform. A prototype of a Web search European spatial data portal is already available at http://www.plan4all.eu. The key aim is to provide a methodology and present best practices towards the standardization of spatial data according to the INSPIRE principles and provide results that would be a reference material for linking data and data specification from the spatial planning point of view. The results include methodology and implementation of multilingual search for data and common portrayal rules for content providers. These are critical services for sharing and understanding spatial data across Europe. Plan4All paradigm shows that a clear applicable methodology for harmonization of spatial data on all different topics of interest can be achieve efficiently. Plan4All shows that it is possible to build Pan European Web access, to link spatial data and to utilize multilingual metadata providing a roadmap for linked spatial data across and hopefully beyond Europe. The proposed demonstration based on Plan4All experience aims to show experience, best practices and methods to achieve data harmonization and provision of linked spatial data on the Web.
Multimedia search over integrated social and sensor networks BIBAFull-Text 283-286
  John Soldatos; Moez Draief; Craig Macdonald; Iadh Ounis
This paper presents work in progress within the FP7 EU-funded project SMART to develop a multimedia search engine over content and information stemming from the physical world, as derived through visual, acoustic and other sensors. Among the unique features of the search engine is its ability to respond to social queries, through integrating social networks with sensor networks. Motivated by this innovation, the paper presents and discusses the state-of-the-art in participatory sensing and other technologies blending social and sensor networks.
Tracking entities in web archives: the LAWA project BIBAFull-Text 287-290
  Marc Spaniol; Gerhard Weikum
Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of different time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property. The LAWA project (Longitudinal Analytics of Web Archive data) is developing an Internet-based experimental testbed for large-scale data analytics on Web archive collections. Its emphasis is on scalable methods for this specific kind of big-data analytics, and software tools for aggregating, querying, mining, and analyzing Web contents over long epochs. In this paper, we highlight our research on {\em entity-level analytics} in Web archive data, which lifts Web analytics from plain text to the entity-level by detecting named entities, resolving ambiguous names, extracting temporal facts and visualizing entities over extended time periods. Our results provide key assets for tracking named entities in the evolving Web, news, and social media.
I-SEARCH: a multimodal search engine based on rich unified content description (RUCoD) BIBAFull-Text 291-294
  Thomas Steiner; Lorenzo Sutton; Sabine Spiller; Marilena Lazzaro; Francesco Nucci; Vincenzo Croce; Alberto Massari; Antonio Camurri; Anne Verroust-Blondet; Laurent Joyeux; Jonas Etzold; Paul Grimm; Athanasios Mademlis; Sotiris Malassiotis; Petros Daras; Apostolos Axenopoulos; Dimitrios Tzovaras
In this paper, we report on work around the I-SEARCH EU (FP7 ICT STREP) project whose objective is the development of a multimodal search engine. We present the project's objectives, and detail the achieved results, amongst which a Rich Unified Content Description format.
Enabling users to create their own web-based machine translation engine BIBAFull-Text 295-298
  Andrejs Vasiljevs; Raivis Skadins; Indra Samite
This paper presents European Union co-funded projects to advance the development and use of machine translation (MT) that will benefit from the possibilities provided by the Web. Current mass-market and online MT systems are of a general nature and perform poorly for smaller languages and domain specific texts. The ICT-PSP Programme project LetsMT! develops a user-driven machine translation "factory in the cloud" enabling web users to get customized MT that better fits their needs. Harnessing the huge potential of the web together with open statistical machine translation (SMT) technologies LetsMT! has created an innovative online collaborative platform for data sharing and building MT. Users can upload their parallel corpora to an online repository and generate user-tailored SMT systems based on user selected data. FP7 Programme project ACCURAT researches new methods for accumulating more data from the Web to improve the quality of data-driven machine translation systems. ACCURAT has created techniques and tools to use comparable corpora such as news feeds and multinational web pages. Although the majority of these texts are not direct translations, they share a lot of common paragraphs, sentences, phrases, terms and named entities in different languages which are useful for machine translation.
Semantic evaluation at large scale (SEALS) BIBAFull-Text 299-302
  Stuart N. Wrigley; Raúl García-Castro; Lyndon Nixon
This paper describes the main goals and outcomes of the EU-funded Framework 7 project entitled Semantic Evaluation at Large Scale (SEALS). The growth and success of the Semantic Web is built upon a wide range of Semantic technologies from ontology engineering tools through to semantic web service discovery and semantic search. The evaluation of such technologies -- and, indeed, assessments of their mutual compatibility -- is critical for their sustained improvement and adoption. The SEALS project is creating an open and sustainable platform on which all aspects of an evaluation can be hosted and executed and has been designed to accommodate most technology types. It is envisaged that the platform will become the de facto repository of test datasets and will allow anyone to organise, execute and store the results of technology evaluations free of charge and without corporate bias. The demonstration will show how individual tools can be prepared for evaluation, uploaded to the platform, evaluated according to some criteria and the subsequent results viewed. In addition, the demonstration will show the flexibility and power of the SEALS Platform for evaluation organisers by highlighting some of the key technologies used.


Twitcident: fighting fire with information from social web streams BIBAFull-Text 305-308
  Fabian Abel; Claudia Hauff; Geert-Jan Houben; Richard Stronkman; Ke Tao
In this paper, we present Twitcident, a framework and Web-based system for filtering, searching and analyzing information about real-world incidents or crises. Twitcident connects to emergency broadcasting services and automatically starts tracking and filtering information from Social Web streams (Twitter) when a new incident occurs. It enriches the semantics of streamed Twitter messages to profile incidents and to continuously improve and adapt the information filtering to the current temporal context. Faceted search and analytical tools allow users to retrieve particular information fragments and overview and analyze the current situation as reported on the Social Web. Demo: http://wis.ewi.tudelft.nl/twitcident/
SWiPE: searching wikipedia by example BIBAFull-Text 309-312
  Maurizio Atzori; Carlo Zaniolo
A novel method is demonstrated that allows semantic and well-structured knowledge bases (such as DBpedia) to be easily queried directly from Wikipedia's pages. Using Swipe, naive users with no knowledge of RDF triples and SPARQL can easily query DBpedia with powerful questions such as: "Who are the U.S. presidents who took office when they were 55-year old or younger, during the last 60 years", or "Find the town in California with less than 10 thousand people". This is accomplished by a novel Search by Example (SBE) approach where a user can enter the query conditions directly on the Infobox of a Wikipedia page. In fact, Swipe activates various fields of Wikipedia to allow users to enter query conditions, and then uses these conditions to generate equivalent SPARQL queries and execute them on DBpedia. Finally, Swipe returns the query results in a form that is conducive to query refinements and further explorations. Swipe's SBE approach makes semi-structured documents queryable in an intuitive and user-friendly way and, through Wikipedia, delivers the benefits of querying and exploring large knowledge bases to all Web users.
ProFoUnd: program-analysis-based form understanding BIBAFull-Text 313-316
  Michael Benedikt; Tim Furche; Andreas Savvides; Pierre Senellart
An important feature of web search interfaces are the restrictions enforced on input values -- those reflecting either the semantics of the data or requirements specific to the interface. Both integrity constraints and "access restrictions" can be of great use to web exploration tools. We demonstrate here a novel technique for discovering constraints that requires no form submissions whatsoever. We work via statically analyzing the JavaScript client-side code used to enforce the constraints, when such code is available. We combine custom recognizers for JavaScript functions relevant to constraint checking with a generic program analysis layer. Integrated with a web browser, our system shows the constraints detected on accessed web forms, and allows a user to see the corresponding JavaScript code fragment.
A social network for video annotation and discovery based on semantic profiling BIBAFull-Text 317-320
  Marco Bertini; Alberto Del Bimbo; Andrea Ferracani; Daniele Pezzatini
This paper presents a system for the social annotation and discovery of videos based on social networks and social knowledge. The system, developed as a web application, allows users to comment and annotate, manually and automatically, video frames and scenes enriching their content with tags, references to Facebook users and pages and Wikipedia resources. These annotations are used to semantically model the interests and the folksonomy of each user and resource in the network, and to suggest to users new resources, Facebook friends and videos whose content is related to their interests. A screencast showing an example of these functionalities is publicly available at: http://vimeo.com/miccunifi/facetube
GovWILD: integrating open government data for transparency BIBAFull-Text 321-324
  Christoph Böhm; Markus Freitag; Arvid Heise; Claudia Lehmann; Andrina Mascher; Felix Naumann; Vuk Ercegovac; Mauricio Hernandez; Peter Haase; Michael Schmidt
Many government organizations publish a variety of data on the web to enable transparency, foster applications, and to satisfy legal obligations. Data content, format, structure, and quality vary widely, even in cases where the data is published using the wide-spread linked data principles. Yet within this data and their integration lies much value: We demonstrate GovWILD, a web-based prototype that integrates and cleanses Open Government Data at a large scale. Apart from the web-based interface that presents a use case of the created dataset at govwild.org, we provide all integrated data as a download. This data can be used to answer questions about politicians, companies, and government funding.
FreeQ: an interactive query interface for freebase BIBAFull-Text 325-328
  Elena Demidova; Xuan Zhou; Wolfgang Nejdl
Freebase is a large-scale open-world database where users collaboratively create and structure content over an open platform. Keyword queries over Freebase are notoriously ambiguous due to the size and the complexity of the dataset. To this end, novel techniques are required to enable naive users to express their informational needs and retrieve the desired data. FreeQ offers users an interactive interface for incremental query construction over a large-scale dataset, so that the users can find desired information quickly and accurately.
Querying socio-spatial networks on the world-wide web BIBAFull-Text 329-332
  Yerach Doytsher; Ben Galon; Yaron Kanza
navigation systems, allow users to record their location history. The location history data can be analyzed to generate life patterns|patterns that associate people to places they frequently visit. Accordingly, an SSN is a graph that consists of (1) a social network, (2) a spatial network, and (3) life patterns that connect the users of the social network to locations, i.e., to geographical entities in the spatial network. In this paper we present a system that stores SNN in a graph-based database management system and provides a novel query language, namely SSNQL, for querying the integrated data. The system includes a Web-based graphical user interface that allows presenting the social network, presenting the spatial network and posing SSNQL queries over the integrated data. The user interface also depicts the structure of queries for the purpose of debugging and optimization. Our demonstration presents the management of the integrated data as an SSN and it illustrates the query evaluation process in SSNQL.
Scalable, flexible and generic instant overview search BIBAFull-Text 333-336
  Pavlos Fafalios; Ioannis Kitsos; Yannis Tzitzikas
The last years there is an increasing interest on providing the top search results while the user types a query letter by letter. In this paper we present and demonstrate a family of instant search applications which apart from showing instantly only the top search results, they can show various other kinds of precomputed aggregated information. This paradigm is more helpful for the end user (in comparison to the classic search-as-you-type), since it can combine autocompletion, search-as-you-type, results clustering, faceted search, entity mining, etc. Furthermore, apart from being helpful for the end user, it is also beneficial for the server's side. However, the instant provision of such services for large number of queries, big amounts of precomputed information, and large number of concurrent users is challenging. We demonstrate how this can be achieved using very modest hardware. Our approach relies on (a) a partitioned trie-based index that exploits the available main memory and disk, and (b) dedicated caching techniques. We report performance results over a server running on a modest personal computer (with 3 GB main memory) that provides instant services for millions of distinct queries and terabytes of precomputed information. Furthermore these services are tolerant to user typos and the word order.
WISER: a web-based interactive route search system for smartphones BIBAFull-Text 337-340
  Roi Friedman; Itsik Hefez; Yaron Kanza; Roy Levin; Eliyahu Safra; Yehoshua Sagiv
Many smartphones, nowadays, use GPS to detect the location of the user, and can use the Internet to interact with remote location-based services. These two capabilities support online navigation that incorporates search. In this demo we presents WISER -- a system for Web-based Interactive Search en Route. In the system, users perform route search by providing (1) a target location, and (2) search terms that specify types of geographic entities to be visited.
   The task is to find a route that minimizes the travel distance from the initial location of the user to the target, via entities of the specified types. However, planning a route under conditions of uncertainty requires the system to take into account the possibility that some visited entities will not satisfy the search requirements, so that the route may need to go via several entities of the same type. In an interactive search, the user provides feedback regarding her satisfaction with entities she visits during the travel, and the system changes the route, in real time, accordingly. The goal is to use the interaction for computing a route that is more effective than a route that is computed in a non-interactive fashion.
Automatically learning gazetteers from the deep web BIBAFull-Text 341-344
  Tim Furche; Giovanni Grasso; Giorgio Orsi; Christian Schallhart; Cheng Wang
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the $4th$ iteration.
FindiLike: preference driven entity search BIBAFull-Text 345-348
  Kavita Ganesan; ChengXiang Zhai
Traditional web search engines enable users to find documents based on topics. However, in finding entities such as restaurants, hotels and products, traditional search engines fail to suffice as users are often interested in finding entities based on structured attributes such as price and brand and unstructured information such as opinions of other web users. In this paper, we showcase a preference driven search system, that enables users to find entities of interest based on a set of structured preferences as well as unstructured opinion preferences. We demonstrate our system in the context of hotel search.
Partisan scale BIBAFull-Text 349-352
  Sedat Gokalp; Hasan Davulcu
US Senate is the venue of political debates where the federal bills are formed and voted. Senators show their support/opposition along the bills with their votes. This information makes it possible to extract the polarity of the senators. We use signed bipartite graphs for modeling debates, and we propose an algorithm for partitioning both the senators, and the bills comprising the debate into binary opposing camps. Simultaneously, our algorithm scales both the senators and the bills on a univariate scale. Using this scale, a researcher can identify moderate and partisan senators within each camp, and polarizing vs. unifying bills. We applied our algorithm on all the terms of the US Senate to the date for longitudinal analysis and developed a web based interactive user interface www.PartisanScale.com to visualize the analysis.
OPAL: a passe-partout for web forms BIBAFull-Text 353-356
  Xiaonan Guo; Jochen Kranzdorf; Tim Furche; Giovanni Grasso; Giorgio Orsi; Christian Schallhart
Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching. Automatic form understanding enables a broad range of applications, including crawlers, meta-search engines, and usability and accessibility support for enhanced web browsing. In this demonstration, we use a novel form understanding approach, OPAL, to assist in form filling even for complex, previously unknown forms. OPAL associates form labels to fields by analyzing structural properties in the HTML encoding and visual features of the page rendering. OPAL interprets this labeling and classifies the fields according to a given domain ontology. The combination of these two properties, allows OPAL to deal effectively with many forms outside of the grasp of existing form filling techniques. In the UK real estate domain, OPAL achieves >99% accuracy in form understanding.
Round-trip semantics with Sztakipedia and DBpedia spotlight BIBAFull-Text 357-360
  Mihály Héder; Pablo N. Mendes
We describe a tool kit to support a knowledge-enhancement cycle on the Web. In the first step, structured data which is extracted from Wikipedia is used to construct automatic content enhancement engines. Those engines can be used to interconnect knowledge in structured and unstructured information sources on the Web, including Wikipedia itself. Sztakipedia-toolbar is a MediaWiki user script which brings DBpedia Spotlight and other kinds of machine intelligence into the Wiki editor interface to provide enhancement suggestions to the user. The suggestions offered by the tool focus on complementing knowledge and increasing the availability of structured data on Wikipedia. This will, in turn, increase the available information for the content enhancement engines themselves, completing a virtuous cycle of knowledge enhancement.
ResEval mash: a mashup tool for advanced research evaluation BIBAFull-Text 361-364
  Muhammad Imran; Felix Kling; Stefano Soi; Florian Daniel; Fabio Casati; Maurizio Marchese
In this demonstration, we present ResEval Mash, a mashup platform for research evaluation, i.e., for the assessment of the productivity or quality of researchers, teams, institutions, journals, and the like -- a topic most of us are acquainted with. The platform is specifically tailored to the need of sourcing data about scientific publications and researchers from the Web, aggregating them, computing metrics (also complex and ad-hoc ones), and visualizing them. ResEval Mash is a hosted mashup platform with a client-side editor and runtime engine, both running inside a common web browser. It supports the processing of also large amounts of data, a feature that is achieved via the sensible distribution of the respective computation steps over client and server. Our preliminary user study shows that ResEval Mash indeed has the power to enable domain experts to develop own mashups (research evaluation metrics); other mashup platforms rather support skilled developers. The reason for this success is ResEval Mash's domain-specificity.
T@gz: intuitive and effortless categorization and sharing of email conversations BIBAFull-Text 365-368
  Parag Mulendra Joshi; Claudio Bartolini; Sven Graupner
In this paper, we describe T@gz, a system designed for effortless and instantaneous sharing of enterprise knowledge through routine email communications and powerful harvesting of such enterprise information using text analytics techniques. T@gz is a system that enables dynamic, non-intrusive and effortless sharing of information within an enterprise and automatically harvests knowledge from such daily interactions. It also allows enterprise knowledge workers to easily subscribe to new information. It enables self organization of information in conversations while it carefully avoids requiring users to substantially change their usual work-flow of exchanging emails.
   Incorporating this system in an enterprise improves productivity by: "discovery of connections between employees with converging interests and expertise, similar to social networks naturally leading to formation of interest groups, avoiding the problem of information lost in mountains of emails," creating expert profiles by mapping areas of expertise or interests to organizational map. Harvested information includes folksonomy appropriate to an organization, tagged and organized conversations and expertise map.
Visual oXPath: robust wrapping by example BIBAFull-Text 369-372
  Jochen Kranzdorf; Andrew Sellers; Giovanni Grasso; Christian Schallhart; Tim Furche
Good examples are hard to find, particularly in wrapper induction: Picking even one wrong example can spell disaster by yielding overgeneralized or overspecialized wrappers. Such wrappers extract data with low precision or recall, unless adjusted by human experts at significant cost.
   Visual OXPath is an open-source, visual wrapper induction system that requires minimal examples and eases wrapper refinement: Often it derives the intended wrapper from a single example through sophisticated heuristics that determine the best set of similar examples. To ease wrapper refinement, it offers a list of wrappers ranked by example similarity and robustness. Visual OXPath offers extensive visual feedback for this refinement which can be performed without any knowledge of the underlying wrapper language. Where further refinement by a human wrapper is needed, Visual OXPath profits from being based on OXPath, a declarative wrapper language that extends XPath with a thin layer of features necessary for extraction and page navigation.
Adding wings to red bull media: search and display semantically enhanced video fragments BIBAFull-Text 373-376
  Thomas Kurz; Sebastian Schaffert; Georg Güntner; Manuel Fernandez
The Linked Data movement with the aims of publishing and interconnecting machine readable data has originated in the last decade. Although the set of (open) data sources is rapidly growing, the integration of multimedia in this Web of Data is still at a very early stage. This paper describes, how arbitrary video content and metadata can be processed to identify meaningful linking partners for video fragments -- and thus create a web of linked media. The video test-set for our demonstrator is part of the Red Bull Content Pool and confined to the Cliff Diving domain. The candidate set of possible link targets is a combination of a Red Bull thesaurus, information about divers from www.redbull.com and concepts from DBPedia. The demo includes both a semantic search on videos and video fragments and a player for videos with semantic enhancements.
Kjing: (mix the knowledge) BIBAFull-Text 377-380
  Daniel Lacroix; Yves-Armel Martin
Kjing is a web app that allow to rapidly set a multiscreen multi-device environment and to interact and distribute content in realtime. It can be used for museographic, educational or conferencing purpose.
Interactive hypervideo visualization for browsing behavior analysis BIBAFull-Text 381-384
  Luis A. Leiva; Roberto Vivó
Processing web interaction data is known to be cumbersome and time-consuming. State-of-the-art web tracking systems usually allow replaying user interactions in the form of mouse tracks, a video-like visualization scheme, to engage practitioners in the analysis process. However, traditional online video inspection has not explored the full capabilities of hypermedia and interactive techniques. In this paper, we introduce a web-based tracking tool that generates interactive visualizations from users' activity. The system unobtrusively collects browser events derived from normal usage, offering a unified framework to inspect interaction data in several ways. We compare our approach to related work in the research community as well as in commercial systems, and describe how ours fits in a real-world scenario. This research shows that there is a wide range of applications where the proposed tool can assist the WWW community.
Simplifying friendlist management BIBAFull-Text 385-388
  Yabing Liu; Bimal Viswanath; Mainack Mondal; Krishna P. Gummadi; Alan Mislove
Online social networks like Facebook allow users to connect, communicate, and share content. The popularity of these services has lead to an information overload for their users; the task of simply keeping track of different interactions has become daunting. To reduce this burden, sites like Facebook allows the user to group friends into specific lists, known as friendlists, aggregating the interactions and content from all friends in each friendlist. While this approach greatly reduces the burden on the user, it still forces the user to create and populate the friendlists themselves and, worse, makes the user responsible for maintaining the membership of their friendlists over time. We show that friendlists often have a strong correspondence to the structure of the social network, implying that friendlists may be automatically inferred by leveraging the social network structure. We present a demonstration of Friendlist Manager, a Facebook application that proposes friendlists to the user based on the structure of their local social network, allows the user to tweak the proposed friendlists, and then automatically creates the friendlists for the user.
The RaiNewsbook: browsing worldwide multimodal news stories by facts, entities and dates BIBAFull-Text 389-392
  Maurizio Montagnuolo; Alberto Messina
This paper presents a novel framework for multimodal news data aggregation, retrieval and browsing. News aggregations are contextualised within automatically extracted information such as entities (i.e. persons, places and organisations), temporal span, categorical topics, social networks popularity and audience scores. Further resources coming from professional repositories, and related to the aggregation topics, can be accessed as well. The system is accessible through a Web interface supporting interactive navigation and exploration of large-scale collections of news stories at the topic and context levels. Users can select news topics and sub-topics interactively, building their personal paths towards worldwide events, main characters, dates and contents.
BabelNetXplorer: a platform for multilingual lexical knowledge base access and exploration BIBAFull-Text 393-396
  Roberto Navigli; Simone Paolo Ponzetto
Knowledge on word meanings and their relations across languages is vital for enabling semantic information technologies: in fact, the ever increasingly multilingual nature of the Web now calls for the development of methods that are both robust and widely applicable for processing textual information in a multitude of languages. In our research, we approach this ambitious task by means of BabelNet, a wide-coverage multilingual lexical knowledge base. In this paper we present an Application Programming Interface and a Graphical User Interface which, respectively, allow programmatic access and visual exploration of BabelNet. Our contribution is to provide the research community with easy-to-use tools for performing multilingual lexical semantic analysis, thereby fostering further research in this direction.
H2RDF: adaptive query processing on RDF data in the cloud BIBAFull-Text 397-400
  Nikolaos Papailiou; Ioannis Konstantinou; Dimitrios Tsoumakos; Nectarios Koziris
In this work we present H2RDF, a fully distributed RDF store that combines the MapReduce processing framework with a NoSQL distributed data store. Our system features two unique characteristics that enable efficient processing of both simple and multi-join SPARQL queries on virtually unlimited number of triples: Join algorithms that execute joins according to query selectivity to reduce processing; and adaptive choice among centralized and distributed (MapReduce-based) join execution for fast query responses. Our system efficiently answers both simple joins and complex multivariate queries and easily scales to 3 billion triples using a small cluster of 9 worker nodes. H2RDF outperforms state-of-the-art distributed solutions in multi-join and nonselective queries while achieving comparable performance to centralized solutions in selective queries. In this demonstration we showcase the system's functionality through an interactive GUI. Users will be able to execute predefined or custom-made SPARQL queries on datasets of different sizes, using different join algorithms. Moreover, they can repeat all queries utilizing a different number of cluster resources. Using real-time cluster monitoring and detailed statistics, participants will be able to understand the advantages of different execution schemes versus the input data as well as the scalability properties of H2RDF over both the data size and the available worker resources.
Paraimpu: a platform for a social web of things BIBAFull-Text 401-404
  Antonio Pintus; Davide Carboni; Andrea Piras
The Web of Things is a scenario where potentially billions of connected smart objects communicate using the Web protocols, HTTP in primis. A Web of Things envisioning and design has raised several research issues, from protocols adoption and communication models to architectural styles and social aspects facing. In this demo we present the prototype of a scalable architecture for a large scale social Web of Things for smart objects and services, named Paraimpu. It is a Web-based platform which allows to add, use, share and inter-connect real HTTP-enabled smart objects and "virtual" things like services on the Web and social networks. Paraimpu defines and uses few strong abstractions, in order to allow mash-ups of heterogeneous things introducing powerful rules for data adaptation. Adding and inter-connecting objects is supported through user friendly models and features.
Automated semantic tagging of speech audio BIBAFull-Text 405-408
  Yves Raimond; Chris Lowis; Roderick Hodgson; Jonathan Tweed
The BBC is currently tagging programmes manually, using DBpedia as a source of tag identifiers, and a list of suggested tags extracted from the programme synopsis. These tags are then used to help navigation and topic-based search of programmes on the BBC website. However, given the very large number of programmes available in the archive, most of them having very little metadata attached to them, we need a way to automatically assign tags to programmes. We describe a framework to do so, using speech recognition, text processing and concept tagging techniques. We describe how this framework was successfully applied to a very large BBC radio archive. We demonstrate an application using automatically extracted tags to aid discovery of archive content.
Baya: assisted mashup development as a service BIBAFull-Text 409-412
  Soudip Roy Chowdhury; Carlos Rodríguez; Florian Daniel; Fabio Casati
In this demonstration, we describe Baya, an extension of Yahoo! Pipes that guides and speeds up development by interactively recommending composition knowledge harvested from a repository of existing pipes. Composition knowledge is delivered in the form of reusable mashup patterns, which are retrieved and ranked on the fly while the developer models his own pipe (the mashup) and that are automatically weaved into his pipe model upon selection. Baya mines candidate patterns from pipe models available online and thereby leverages on the knowledge of the crowd, i.e., of other developers. Baya is an extension for the Firefox browser that seamlessly integrates with Pipes. It enhances Pipes with a powerful new feature for both expert developers and beginners, speeding up the former and enabling the latter. The discovery of composition knowledge is provided as a service and can easily be extended toward other modeling environments.
S2S architecture and faceted browsing applications BIBAFull-Text 413-416
  Eric Rozell; Peter Fox; Jin Zheng; Jim Hendler
This demo paper will discuss a search interface framework designed as part of the Semantic eScience Framework project at the Tetherless World Constellation. The search interface framework, S2S, was designed to facilitate the construction of interactive user interfaces for data catalogs. We use Semantic Web technologies, including an OWL ontology for describing the semantics of data services, as well as the semantics of user interface components. We have applied S2S in three different scenarios: (1) the development of a faceted browse interface integrated with an interactive mapping and visualization tool for biological and chemical oceanographic data, (2) the development of a faceted browser for more than 700,000 open government datasets in over 100 catalogs worldwide, and (3) the development of a user interface for a virtual observatory in the field of solar-terrestrial physics. Throughout this paper, we discuss the architecture of the S2S framework, focusing on its extensibility and reusability, and also review the application scenarios.
Turning a Web 2.0 social network into a Web 3.0, distributed, and secured social web application BIBAFull-Text 417-420
  Henry Story; Romain Blin; Julien Subercaze; Christophe Gravier; Pierre Maret
This demonstration presents the process of transforming a Web 2.0 centralized social network into a Web 3.0, distributed, and secured Social application, and what was learnt in this process. The initial Web 2.0 Social Network application was written by a group of students over a period of 4 months in the spring of 2011. It had all the bells and whistles of the well known Social Networks: walls to post on, circles of friends, etc. The students were very enthusiastic in building their social network, but the chances of it growing into a large community were close to non-existent unless a way could be found to tie it into a bigger social network. This is where linked data protected by the Web Access Control Ontology and WebID authentication could come to the rescue. The paper describes this transformation process, and we will demonstrate the full software version at the conference.
Adding fake facts to ontologies BIBAFull-Text 421-424
  Fabian M. Suchanek; David Gross-Amblard
In this paper, we study how artificial facts can be added to an RDFS ontology. Artificial facts are an easy way of proving the ownership of an ontology: If another ontology contains the artificial fact, it has probably been taken from the original ontology. We show how the ownership of an ontology can be established with provably tight probability bounds, even if only parts of the ontology are being re-used. We explain how artificial facts can be generated in an inconspicuous and minimally disruptive way. Our demo allows users to generate artificial facts and to guess which facts were generated.
CASIS: a system for concept-aware social image search BIBAFull-Text 425-428
  Ba Quan Truong; Aixin Sun; Sourav S. Bhowmick
Tag-based social image search enables users to formulate queries using keywords. However, as queries are usually very short and users have very different interpretations of a particular tag in annotating and searching images, the returned images to a tag query usually contain a collection of images related to multiple concepts. We demonstrate Casis, a system for concept-aware social image search. Casis detects tag concepts based on the collective knowledge embedded in social tagging from the initial results to a query. A tag concept is a set of tags highly associated with each other and collectively conveys a semantic meaning. Images to a query are then organized by tag concepts. Casis provides intuitive and interactive browsing of search results through a tag concept graph, which visualizes the tags defining each tag concept and their relationships within and across concepts. Supporting multiple retrieval methods and multiple concept detection algorithms, Casis offers superior social image search experiences by choosing the most suitable retrieval methods and concept-aware image organizations.
In the mood for affective search with web stereotypes BIBAFull-Text 429-432
  Tony Veale; Yanfen Hao
Models of sentiment analysis in text require an understanding of what kinds of sentiment-bearing language are generally used to describe specific topics. Thus, fine-grained sentiment analysis requires both a topic lexicon and a sentiment lexicon, and an affective mapping between both. For instance, when one speaks disparagingly about a city (like London, say), what aspects of city does one generally focus on, and what words are used to disparage those aspects? As when we talk about the weather, our language obeys certain familiar patterns -- what we might call clichés and stereotypes -- when we talk about familiar topics. In this paper we describe the construction of an affective stereotype lexicon, that is, a lexicon of stereotypes and their most salient affective qualities. We show, via a demonstration system called MOODfinger, how this lexicon can be used to underpin the processes of affective query expansion and summarization in a system for retrieving and organizing news content from the Web. Though we adopt a simple bipolar +/- view of sentiment, we show how this stereotype lexicon allows users to coin their own nuanced moods on demand.
Personalized newscasts and social networks: a prototype built over a flexible integration model BIBAFull-Text 433-436
  Luca Vignaroli; Roberto Del Pero; Fulvio Negro
The way we watch television is changing with the introduction of attractive Web activities that move users away from TV to other media. The integration of the cultures of TV and Web is still an open issue. How can we make TV more open? How can we enable a possible collaboration of these two different worlds? TV-Web convergence is much more than placing a Web browser into a TV set or putting TV content into a Web media player. The NoTube project, funded by the European Community, is demonstrating how an open and general set of tools adaptable to a number of possible scenarios and allowing a designer to implement the targeted final service with ease can be introduced. A prototype based on the NoTube model in which the Smartphone is used as secondary screen is presented. The video demonstration [11] is available at http://youtu.be/dMM7MH9CZY8.
An early warning system for unrecognized drug side effects discovery BIBAFull-Text 437-440
  Hao Wu; Hui Fang; Steven J. Stanhope
Drugs can treat human diseases through chemical interactions between the ingredients and intended targets in the human body. However, the ingredients could unexpectedly interact with off-targets, which may cause adverse drug side effects. Notifying patients and physicians of potential drug effects is an important step in improving healthcare quality and delivery. With the increasing popularity of Web 2.0 applications, more and more patients start discussing drug side effects in many online sources. In this paper, we describe our efforts on building UDWarning, a novel early warning system for unrecognized drug side effects discovery based on the text information gathered from the Internet. The system can automatically build a knowledge base for drug side effects by integrating the information related to drug side effects from different sources. It can also monitor the online information about drugs and discover possible unrecognized drug side effects. Our demonstration will show that the system has the potentials to expedite the discovery process of unrecognized drug side effects and to improve the quality of healthcare.
Titan: a system for effective web service discovery BIBAFull-Text 441-444
  Jian Wu; Liang Chen; Yanan Xie; Zibin Zheng
With the increase of web services and user demand's diversity, effective web service discovery is becoming a big challenge. Clustering web services would greatly boost the ability of web service search engine to retrieve relevant ones. In this paper, we propose a web service search engine Titan which contains 15,969 web services crawled from the Internet. In Titan, two main technologies, i.e., web service clustering and tag recommendation, are employed to improve the effectiveness of web service discovery. Specifically, both WSDL (Web Service Description Language) documents and tags of web services are utilized for clustering, while tag recommendation is adopted to handle some inherent problems of tagging data, e.g., uneven tag distribution and noise tags.
Deep answers for naturally asked questions on the web of data BIBAFull-Text 445-449
  Mohamed Yahya; Klaus Berberich; Shady Elbassuoni; Maya Ramanath; Volker Tresp; Gerhard Weikum
We present DEANNA, a framework for natural language question answering over structured knowledge bases. Given a natural language question, DEANNA translates questions into a structured SPARQL query that can be evaluated over knowledge bases such as Yago, Dbpedia, Freebase, or other Linked Data sources. DEANNA analyzes questions and maps verbal phrases to relations and noun phrases to either individual entities or semantic classes. Importantly, it judiciously generates variables for target entities or classes to express joins between multiple triple patterns. We leverage the semantic type system for entities and use constraints in jointly mapping the constituents of the question to relations, classes, and entities. We demonstrate the capabilities and interface of DEANNA, which allows advanced users to influence the translation process and to see how the different components interact to produce the final result.

Poster presentations

Associating structured records to text documents BIBAFull-Text 451-452
  Rakesh Agrawal; Ariel Fuxman; Anitha Kannan; John Shafer; Partha Pratim Talukdar
Postulate two independently created data sources. The first contains text documents, each discussing one or a small number of objects. The second is a collection of structured records, each containing information about the characteristics of some objects. We present techniques for associating structured records to corresponding text documents and empirical results supporting the proposed techniques.
Textual and contextual patterns for sentiment analysis over microblogs BIBAFull-Text 453-454
  Fotis Aisopos; George Papadakis; Konstantinos Tserpes; Theodora Varvarigou
Microblog content poses serious challenges to the applicability of sentiment analysis, due to its inherent characteristics. We introduce a novel method relying on content-based and context-based features, guaranteeing high effectiveness and robustness in the settings we are considering. The evaluation of our methods over a large Twitter data set indicates significant improvements over the traditional techniques.
PAC'nPost: a framework for a micro-blogging social network in an unstructured P2P network BIBAFull-Text 455-456
  H. Asthana; Ingemar J. Cox
We describe a framework for a micro-blogging social network implemented in an unstructured peer-to-peer network. A micro-blogging social network must provide capabilities for users to (i) publish, (ii) follow and (iii) search. Our retrieval mechanism is based on a probably approximately correct (PAC) search architecture in which a query is sent to a fixed number of nodes in the network. In PAC, the probability of attaining a particular accuracy is a function of the number of nodes queried (fixed) and the replication rate of documents (micro-blog). Publishing a micro-blog then becomes a matter of replicating the micro-blog to the required number of random nodes without any central coordination. To solve this, we use techniques from the field of rumour spreading (gossip protocols) to propagate new documents. Our document spreading algorithm is designed such that a document has a very high probability of being copied to only the required number of nodes. Results from simulations performed on networks of 10,000, 100,000 and 500,000 nodes verify our mathematical models. The framework is also applicable for indexing dynamic web pages in a distributed search engine or for a system which indexes newly created BitTorrents in a decentralized environment.
The impact of visual appearance on user response in online display advertising BIBAFull-Text 457-458
  Javad Azimi; Ruofei Zhang; Yang Zhou; Vidhya Navalpakkam; Jianchang Mao; Xiaoli Fern
Display advertising has been a significant source of revenue for publishers and ad networks in the online advertising ecosystem. One of the main goals in display advertising is to maximize user response rate for advertising campaigns, such as click through rates (CTR) or conversion rates. Although the visual appearance of ads (creatives) matters for propensity of user response, there is no published work so far to address this topic via a systematic data-driven approach. In this paper we quantitatively study the relationship between the visual appearance and performance of creatives using large scale data in the world's largest display ads exchange system, RightMedia. We designed a set of 43 visual features, some of which are novel and some are inspired by related work. We extracted these features from real creatives served on RightMedia. Then, we present recommendations of visual features that have the most important impact on CTR to the professional designers in order to optimize their creative design. We believe that the findings presented in this paper will be very useful for the online advertising industry in designing high-performance creatives. We have also designed and conducted an experiment to evaluate the effectiveness of visual features by themselves for CTR prediction.
Impact of ad impressions on dynamic commercial actions: value attribution in marketing campaigns BIBAFull-Text 459-460
  Joel Barajas; Ram Akella; Marius Holtan; Jaimie Kwon; Aaron Flores; Victor Andrei
We develop a descriptive method to estimate the impact of ad impressions on commercial actions dynamically without tracking cookies. We analyze 2,885 campaigns for 1,251 products from the Advertising.com ad network. We compare our method with A/B testing for 2 campaigns, and with a public synthetic dataset.
Audience dynamics of online catch up TV BIBAFull-Text 461-462
  Thomas Beauvisage; Jean-Samuel Beuscart
This paper studies the demand for TV contents on online catch up platforms, in order to assess how catch up TV offers transform TV consumption. We build upon empirical data on French TV consumption in June 2011: a daily monitoring of online audience on web catch up platforms, and live audience ratings of traditional broadcast TV. We provide three main results: 1) online consumption is more concentrated than off-line audience, contradicting the hypothesis of a long tail effect of catch up TV; 2) the temporality of replay TV consumption on the web is very close to the live broadcasting of the programs, thus softening rather than breaking the synchrony of traditional TV; 3) detailed data on online consumption of news reveals two patterns of consumption ("alternative TV ritual" vs. "à la carte").
Group recommendations via multi-armed bandits BIBAFull-Text 463-464
  José Bento; Stratis Ioannidis; S. Muthukrishnan; Jinyun Yan
We study recommendations for persistent groups that repeatedly engage in a joint activity. We approach this as a multi-arm bandit problem. We design a recommendation policy and show it has logarithmic regret. Our analysis also shows that regret depends linearly on d, the size of the underlying persistent group. We evaluate our policy on movie recommendations over the MovieLens and MoviePilot datasets.
A revenue sharing mechanism for federated search and advertising BIBAFull-Text 465-466
  Marco Brambilla; Sofia Ceppi; Nicola Gatti; Enrico H. Gerding
Federated search engines combine search results from two or more (general -- purpose or domain -- specific) content providers. They enable complex searches (e.g., complete vacation planning) or more reliable results by allowing users to receive high quality results from a variety of sources. We propose a new revenue sharing mechanism for federated search engines, considering different actors involved in the search results generation (i.e., content providers, advertising providers, hybrid content+advertising providers, and content integrators). We extend the existing sponsored search auctions by supporting heterogeneous participants and redistribution of monetary values to the different actors, while maintaining flexibility in the payment scheme.
Efficient multi-view maintenance in the social semantic web BIBAFull-Text 467-468
  Matthias Broecheler; Andrea Pugliese; V. S. Subrahmanian
The Social Semantic Web (SSW) refers to the mix of RDF data in web content, and social network data associated with those who posted that content. Applications to monitor the SSW are becoming increasingly popular. For instance, marketers want to look for semantic patterns relating to the content of tweets and Facebook posts relating to their products. Such applications allow multiple users to specify patterns of interest, and monitor them in real-time as new data gets added to the web or to a social network. In this paper, we develop the concept of SSW view servers in which all of these types of applications can be simultaneously monitored from such servers. The patterns of interest are views. We show that a given set of views can be compiled in multiple possible ways to take advantage of common substructures, and define the concept of an optimal merge. We develop a very fast MultiView algorithm that scalably and efficiently maintains multiple subgraph views. We show that our algorithm is correct, study its complexity, and experimentally demonstrate that our algorithm can scalably handle updates to hundreds of views on real-world SSW databases with up to 540M edges.
BlueFinder: estimate where a beach photo was taken BIBAFull-Text 469-470
  Liangliang Cao; John R. Smith; Zhen Wen; Zhijun Yin; Xin Jin; Jiawei Han
This paper describes a system to estimate geographical locations for beach photos. We develop an iterative method that not only trains visual classifiers but also discovers geographical clusters for beach regions. The results show that it is possible to recognize different beaches using visual information with reasonable accuracy, and our system works 27 times better than random guess for the geographical localization task.
News comments generation via mining microblogs BIBAFull-Text 471-472
  Xuezhi Cao; Kailong Chen; Rui Long; Guoqing Zheng; Yong Yu
Microblogging websites such as Twitter and Chinese Sina Weibo contain large amounts of microblogs posted by users. Many of these microblogs are highly sensitive to the important real-world events and correlated to the news events. Thus, microblogs from these websites can be collected as comments for the news to reveal the opinions and attitude towards the news event among large number of users. In this paper, we present a framework to automatically collect relevant microblogs from microblogging websites to generate comments for popular news on news websites.
MobiMash: end user development for mobile mashups BIBAFull-Text 473-474
  Cinzia Cappiello; Maristella Matera; Matteo Picozzi; Alessandro Caio; Mariano Tomas Guevara
The adoption of adequate tools, oriented towards the End User Development (EUD), can promote mobile mashups as "democratic" tools, able to accommodate the long tail of users' specific needs. We introduce MobiMash, a novel approach and a platform for the construction of mobile mashups, characterized by a lightweight composition paradigm, mainly guided by the notion of visual templates. The composition paradigm generates an application schema that is based on a domain specific language addressing dimensions for data integration and service orchestration, and that guides at run-time the dynamic instantiation of the final mobile app.
Privacy management for online social networks BIBAFull-Text 475-476
  Gorrell P. Cheek; Mohamed Shehab
We introduce a privacy management approach that leverages users' memory and opinion of their friends to set policies for other similar friends. We refer to this new approach as Same-As Privacy Management. To demonstrate the effectiveness of our privacy management improvements, we implemented a prototype Facebook application and conducted an extensive user study. We demonstrated considerable reductions in policy authoring time using Same-As Privacy Management over traditional group based privacy management approaches. Finally, we presented user perceptions, which were very encouraging.
Fast and cost-efficient bid estimation for contextual ads BIBAFull-Text 477-478
  Ye Chen; Pavel Berkhin; Jie Li; Sharon Wan; Tak W. Yan
We study the problem of estimating the value of a contextual ad impression, and based upon which an ad network bids on an exchange. The ad impression opportunity would materialize into revenue only if the ad network wins the impression and a user clicks on the ads, both as a rare event especially in an open exchange for contextual ads. Given a low revenue expectation and the elusive nature of predicting weak-signal click-through rates, the computational cost incurred by bid estimation shall be cautiously justified. We developed and deployed a novel impression valuation model, which is expected to reduce the computational cost by 95% and hence more than double the profit. Our approach is highly economized through a fast implementation of kNN regression that primarily leverages low-dimensional sell-side data (user and publisher). We also address the cold-start problem or the exploration vs. exploitation requirement by Bayesian smoothing using a beta prior, and adapt to the temporal dynamics using an autoregressive model.
Fast query evaluation for ad retrieval BIBAFull-Text 479-480
  Ye Chen; Mitali Gupta; Tak W. Yan
We describe a fast query evaluation method for ad document retrieval in online advertising, based upon the classic WAND algorithm. The key idea is to localize per-topic term upper bounds into homogeneous ad groups. Our approach is not only theoretically motivated by a topical mixture model; but empirically justified by the characteristics of the ad domain, that is, short and semantically focused documents with natural hierarchy. We report experimental results using artificial and real-world query-ad retrieval data, and show that the tighter-bound WAND outperforms the traditional approach by 35.4% reduction in number of full evaluations.
CONSENTO: a consensus search engine for answering subjective queries BIBAFull-Text 481-482
  Jaehoon Choi; Donghyeon Kim; Seongsoon Kim; Junkyu Lee; Sangrak Lim; Sunwon Lee; Jaewoo Kang
Search engines have become an important decision making tool today. Decision making queries are often subjective, such as 'best sedan for family use,' 'best action movies in 2010,' to name a few. Unfortunately, such queries cannot be answered properly by conventional search systems. In order to address this problem, we introduce Consento, a consensus search engine designed to answer subjective queries. Consento performs subdocument-level indexing to more precisely capture semantics from user opinions. We also introduce a new ranking method, or ConsensusRank that counts in online comments referring to an entity as a weighted vote to the entity. We validated the framework with an empirical study using the data on movie reviews.
Good abandonments in factoid queries BIBAFull-Text 483-484
  Aleksandr Chuklin; Pavel Serdyukov
It is often considered that high abandonment rate corresponds to poor IR system performance. However several studies suggested that there are so called good abandonments, i.e. situations when search engine result page (SERP) contains enough details to satisfy the user information need without necessity to click on search results. In those papers only editorial metrics of SERP were used, and one cannot be sure that situations marked as good abandonments by assessors actually imply user satisfaction. In present work we propose some real-world evidences for good abandonments by calculating correlation between editorial and click metrics.
Potential good abandonment prediction BIBAFull-Text 485-486
  Aleksandr Chuklin; Pavel Serdyukov
Abandonment rate is one of the most broadly used online user satisfaction metrics. In this paper we discuss the notion of potential good abandonment, i.e. queries that may potentially result in user satisfaction without the need to click on search results (if search engine result page contains enough details to satisfy the user information need). We show, that we can train a classifier which is able to distinguish between potential good and bad abandonments with rather good results compared to our baseline. As a case study we show how to apply these ideas to IR evaluation and introduce a new metric for A/B-testing -- Bad Abandonment Rate.
Ubiquitous access control for SPARQL endpoints: lessons learned and future challenges BIBAFull-Text 487-488
  Luca Costabello; Serena Villata; Nicolas Delaforge; Fabien Gandon
We present and evaluate a context-aware access control framework for SPARQL endpoints queried from mobile.
Mining for insights in the search engine query stream BIBAFull-Text 489-490
  Ovidiu Dan; Pavel Dmitriev; Ryen W. White
Search engines record a large amount of metadata each time a user issues a query. While efficiently mining this data can be challenging, the results can be useful in multiple ways, including monitoring search engine performance, improving search relevance, prioritizing research, and optimizing day-to-day operations. In this poster, we describe an approach for mining query log data for actionable insights -- specific query segments (sets of queries) that require attention, and actions that need to be taken to improve the segments. Starting with a set of important metrics, we identify query segments that are "interesting" with respect to these metrics using a distributed frequent itemset mining algorithm.
Developing domain-specific mashup tools for end users BIBAFull-Text 491-492
  Florian Daniel; Muhammad Imran; Felix Kling; Stefano Soi; Fabio Casati; Maurizio Marchese
The recent emergence of mashup tools has refueled research on end user development, i.e., on enabling end users without programming skills to compose own applications. Yet, similar to what happened with analogous promises in web service composition and business process management, research has mostly focused on technology and, as a consequence, has failed its objective. Plain technology (e.g., SOAP/WSDL web services) or simple modeling languages (e.g., Yahoo! Pipes) don't convey enough meaning to non-programmers. We propose a domain-specific approach to mashups that "speaks the language of the user", i.e., that is aware of the terminology, concepts, rules, and conventions (the domain) the user is comfortable with. We show what developing a domain-specific mashup tool means, which role the mashup meta-model and the domain model play and how these can be merged into a domain-specific mashup meta-model. We apply the approach implementing a mashup tool for the research evaluation domain. Our user study confirms that domain-specific mashup tools indeed lower the entry barrier to mashup development.
Discovery and reuse of composition knowledge for assisted mashup development BIBAFull-Text 493-494
  Florian Daniel; Carlos Rodriguez; Soudip Roy Chowdhury; Hamid R. Motahari Nezhad; Fabio Casati
Despite the emergence of mashup tools like Yahoo! Pipes or JackBe Presto Wires, developing mashups is still non-trivial and requires intimate knowledge about the functionality of web APIs and services, their interfaces, parameter settings, data mappings, and so on. We aim to assist the mashup process and to turn it into an interactive co-creation process, in which one part of the solution comes from the developer and the other part from reusable composition knowledge that has proven successful in the past. We harvest composition knowledge from a repository of existing mashup models by mining a set of reusable composition patterns, which we then use to interactively provide composition recommendations to developers while they model their own mashup. Upon acceptance of a recommendation, the purposeful design of the respective pattern types allows us to automatically weave the chosen pattern into a partial mashup model, in practice performing a set of modeling actions on behalf of the developer. The experimental evaluation of our prototype implementation demonstrates that it is indeed possible to harvest meaningful, reusable knowledge from existing mashups, and that even complex recommendations can be efficiently queried and weaved also inside the client browser.
Towards personalized learning to rank for epidemic intelligence based on social media streams BIBAFull-Text 495-496
  Ernesto Diaz-Aviles; Avaré Stewart; Edward Velasco; Kerstin Denecke; Wolfgang Nejdl
In the presence of sudden outbreaks, how can social media streams be used to strengthen surveillance capabilities? In May 2011, Germany reported one of the largest described outbreaks of Enterohemorrhagic Escherichia coli (EHEC). By end of June, 47 persons had died. After the detection of the outbreak, authorities investigating the cause and the impact in the population were interested in the analysis of micro-blog data related to the event. Since Thousands of tweets related to this outbreak were produced every day, this task was overwhelming for experts participating in the investigation. In this work, we propose a Personalized Tweet Ranking algorithm for Epidemic Intelligence (PTR4EI), that provides users a personalized, short list of tweets based on the user's context. PTR4EI is based on a learning to rank framework and exploits as features, complementary context information extracted from the social hash-tagging behavior in Twitter. Our experimental evaluation on a dataset, collected in real-time during the EHEC outbreak, shows the superior ranking performance of PTR4EI. We believe our work can serve as a building block for an open early warning system based on Twitter, helping to realize the vision of Epidemic Intelligence for the Crowd, by the Crowd.
D2RQ/update: updating relational data via virtual RDF BIBAFull-Text 497-498
  Vadim Eisenberg; Yaron Kanza
D2RQ is a popular RDB-to-RDF mapping platform that supports mapping relational databases to RDF and posing SPARQL queries to these relational databases. However, D2RQ merely provides a read-only RDF view on relational databases. Thus, we introduce D2RQ/Update -- an extension of D2RQ to enable executing SPARQL/Update statements on the mapped data, and to facilitate the creation of a read-write Semantic Web.
HeterRank: addressing information heterogeneity for personalized recommendation in social tagging systems BIBAFull-Text 499-500
  Wei Feng; Jianyong Wang
A social tagging system provides users an effective way to collaboratively annotate and organize items with their own tags. A social tagging system contains heterogenous information like users' tagging behaviors, social networks, tag semantics and item profiles. All the heterogenous information helps alleviate the cold start problem due to data sparsity. In this paper, we model a social tagging system as a multi-type graph and propose a graph-based ranking algorithm called HeterRank for tag recommendation. Experimental results on three publicly available datasets, i.e., CiteULike, Last.fm and Delicious prove the effectiveness of HeterRank for tag recommendation with heterogenous information.
Domain adaptive answer extraction for discussion boards BIBAFull-Text 501-502
  Ankur Gandhe; Dinesh Raghu; Rose Catherine
Answer extraction from discussion boards is an extensively studied problem. Most of the existing work is focused on supervised methods for extracting answers using similarity features and forum-specific features. Although this works well for the domain or forum data that it has been trained on, it is difficult to use the same models for a domain where the vocabulary is different and some forum specific features may not be available. In this poster, we report initial results of a domain adaptive answer extractor that performs the extraction in two steps: a) an answer recognizer identifies the sentences in a post which are likely to be answers, and b) a domain relevance module determines the domain significance of the identified answer. We use domain independent methodology that can be easily adapted to any given domain with minimum effort.
Towards multiple identity detection in social networks BIBAFull-Text 503-504
  Kahina Gani; Hakim Hacid; Ryan Skraba
In this paper we discuss a piece of work which intends to provide some insights regarding the resolution of the hard problem of multiple identities detection. Based on hypothesis that each person is unique and identifiable whether in its writing style or social behavior, we propose a Framework relying on machine learning models and a deep analysis of social interactions, towards such detection.
How shall we catch people's concerns in micro-blogging? BIBAFull-Text 505-506
  Heng Gao; Qiudan Li; Hongyun Bao; Shuangyong Song
In micro-blogging, people talk about their daily life and change minds freely, thus by mining people's interest in micro-blogging, we will easily perceive the pulse of society. In this paper, we catch what people are caring about in their daily life by discovering meaningful communities based on probabilistic factor model (PFM). The proposed solution identifies people's interest from their friendship and content information. Therefore, it reveals the behaviors of people in micro-blogging naturally. Experimental results verify the effectiveness of the proposed model and show people's social life vividly.
Link prediction via latent factor BlockModel BIBAFull-Text 507-508
  Sheng Gao; Ludovic Denoyer; Patrick Gallinari
In this paper we address the problem of link prediction in networked data, which appears in many applications such as social network analysis or recommender systems. Previous studies either consider latent feature based models but disregarding local structure in the network, or focus exclusively on capturing local structure of objects based on latent blockmodels without coupling with latent characteristics of objects. To combine the benefits of previous work, we propose a novel model that can incorporate the effects of latent features of objects and local structure in the network simultaneously. To achieve this, we model the relation graph as a function of both latent feature factors and latent cluster memberships of objects to collectively discover globally predictive intrinsic properties of objects and capture latent block structure in the network to improve prediction performance. Extensive experiments on several real world datasets suggest that our proposed model outperforms the other state of the art approaches for link prediction.
Using toolbar data to understand Yahoo!: answers usage BIBAFull-Text 509-510
  Giovanni Gardelli; Ingmar Weber
We use Yahoo! Toolbar data to gain insights into why people use Q&A sites. We look at questions asked on Yahoo! Answers and analyze both the pre-question behavior of users as well as their general online behavior. Our results indicate that there is a one-dimensional spectrum of users ranging from "social users" to "informational users" and that web search and Q&A sites complement each other, rather than compete. Concerning the pre-question behavior, users who first issue a question-related query are more likely to issue informational questions, rather than conversational ones, and such questions are less likely to attract an answer. Finally, we only find weak evidence for topical congruence between a user's questions and his web queries.
SnoopyTagging: recommending contextualized tags to increase the quality and quantity of meta-information BIBAFull-Text 511-512
  Wolfgang Gassler; Eva Zangerle; Martin Bürgler; Günther Specht
Current mass-collaboration platforms use tags to annotate and categorize resources enabling effective search capabilities. However, as tags are freely chosen keywords, the resulting tag vocabulary is very heterogeneous. Another shortcoming of simple tags is that they do not allow for a specification of context to create meaningful metadata. In this paper we present the SnoopyTagging approach which supports the user in the process of creating contextualized tags while at the same time decreasing the heterogeneity of the tag vocabulary by facilitating intelligent self-learning recommendation algorithms.
Comparative evaluation of javascript frameworks BIBAFull-Text 513-514
  Andreas Gizas; Sotiris Christodoulou; Theodore Papatheodorou
For web programmers, it is important to choose the proper JavaScript framework that not only serves their current web project needs, but also provides code of high quality and good performance. The scope of this work is to provide a thorough quality and performance evaluation of the most popular JavaScript frameworks, taking into account well established software quality factors and performance tests. The major outcome is that we highlight the pros and cons of JavaScript frameworks in various areas of interest and signify which and where are the problematical points of their code, that probably need to be improved in the next versions.
Getting more RDF support from relational databases BIBAFull-Text 515-516
  François Goasdoué; Ioana Manolescu; Alexandra Roatis
We introduce the database fragment of RDF, which extends the popular Description Logic fragment, in particular with support for incomplete information. We then provide novel sound and complete saturation- and reformulation-based techniques for answering the Basic Graph Pattern queries of SPARQL in this fragment. Notably, we extend the state of the art on pushing RDF query processing within robust / efficient relational database management systems. Finally, we experimentally compare our query answering techniques using well-established datasets.
S2ORM: exploiting syntactic and semantic information for opinion retrieval BIBAFull-Text 517-518
  Liqiang Guo; Xiaojun Wan
Opinion retrieval is the task of finding documents that express an opinion about a given query. A key challenge in opinion retrieval is to capture the query-related opinion score of a document. Existing methods rely mainly on the proximity information between the opinion terms and the query terms to address the key challenge. In this study, we propose to incorporate the syntactic and semantic information of terms into a probabilistic language model in order to capture the query-related opinion score more accurately.
All our messages are belong to us: usable confidentiality in social networks BIBAFull-Text 519-520
  Marian Harbach; Sascha Fahl; Thomas Muders; Matthew Smith
Current online social networking (OSN) sites pose severe risks to their users' privacy. Facebook in particular is capturing more and more of a user's past activities, sometimes starting from the day of birth. Instead of transiently passing on information between friends, a user's data is stored persistently and therefore subject to the risk of undesired disclosure. Traditionally, a regular user of a social network has little awareness of her privacy needs in the Web or is not ready to invest a considerable effort in securing her online activities. Furthermore, the centralised nature of proprietary social networking platforms simply does not cater for end-to-end privacy protection mechanisms. In this paper, we present a non-disruptive and lightweight integration of a confidentiality mechanism into OSNs. Additionally, direct integration of visual security indicators into the OSN UI raise the awareness for (un)protected content and thus their own privacy. We present a fully-working prototype for Facebook and an initial usability study, showing that, on average, untrained users can be ready to use the service in three minutes.
Populating personal linked data caches using context models BIBAFull-Text 521-522
  Olaf Hartig; Tom Heath
The emergence of a Web of Data enables new forms of application that require expressive query access, for which mature, Web-scale information retrieval techniques may not be suited. Rather than attempting to deliver expressive query capabilities at Web-scale, this paper proposes the use of smaller, pre-populated data caches whose contents are personalized to the needs of an individual user. We present an approach to a priori population of such caches with Linked Data harvested from the Web, seeded by a simple context model for each user, which is progressively enriched by executing a series of enrichment rules over Linked Data from the Web. Such caches can act as personal data stores supporting a range of different applications. A comprehensive user evaluation demonstrates that our approach can accurately predict the relevance of attributes added to the context model and the execution probability of queries based on these attributes, thereby optimizing the cache population process.
Probabilistic critical path identification for cost-effective monitoring of service-based web applications BIBAFull-Text 523-524
  Qiang He; Jun Han; Yun Yang; Jean-Guy Schneider; Hai Jin; Steve Versteeg
The critical path of a composite Web application operating in volatile environments, i.e., the execution path in the service composition with the maximum execution time, should be prioritised in cost-effective monitoring as it determines the response time of the Web application. In volatile operating environments, the critical path of a Web application is probabilistic. As such, it is important to estimate the criticalities of the execution paths, i.e., the probabilities that they are critical, to decide which parts of the system to monitor. We propose a novel approach to the identification of Probabilistic Critical Path for Service-based Web Applications (PCP-SWA), which calculates the criticalities of different execution paths in the context of service composition. We evaluate PCP-SWA experimentally using an example Web application. Compared to random monitoring, PCP-SWA based monitoring is 55.67% more cost-effective on average.
A statistical approach to URL-based web page clustering BIBAFull-Text 525-526
  Inma Hernández; Carlos R. Rivero; David Ruiz; Rafael Corchuelo
Most web page classifiers use features from the page content, which means that it has to be downloaded to be classified. We propose a technique to cluster web pages by means of their URL exclusively. In contrast to other proposals, we analyze features that are outside the page, hence, we do not need to download a page to classify it. Also, it is non-supervised, requiring little intervention from the user. Furthermore, we do not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages. We have performed an experiment over 21 highly visited websites to evaluate the performance of our classifier, obtaining good precision and recall results.
Frequent temporal social behavior search in information networks BIBAFull-Text 527-528
  Hsun-Ping Hsieh; Cheng-Te Li; Shou-De Lin
In current social networking service (SNS) such as Facebook, there are diverse kinds of interactions between entity types. One commonly-used activity of SNS users is to track and observe the representative social and temporal behaviors of other individuals. This inspires us to propose a new problem of Temporal Social Behavior Search (TSBS) from social interactions in an information network: given a structural query with associated temporal labels, how to find the subgraph instances satisfying the query structure and temporal requirements? In TSBS, a query can be (a) a topological structure, (b) the partially-assigned individuals on nodes, and/or (c) the temporal sequential labels on edges. The TSBS method consists of two parts: offline mining and online matching. to the former mines the temporal subgraph patterns for retrieving representative structures that match the query. Then based on the given query, we perform the online structural matching on the mined patterns and return the top-k resulting subgraphs. Experiments on academic datasets demonstrate the effectiveness of TSBS.
TripRec: recommending trip routes from large scale check-in data BIBAFull-Text 529-530
  Hsun-Ping Hsieh; Cheng-Te Li; Shou-De Lin
With location-based services, such as Foursquare and Gowalla, users can easily perform check-in actions anywhere and anytime. Such check-in data not only enables personal geospatial journeys but also serves as a fine-grained source for trip planning. In this work, we aim to collectively recommend trip routes by leveraging a large-scaled check-in data through mining the moving behaviors of users. A novel recommendation system, TripRec, is proposed to allow users to specify starting/end and must-go locations. It further provides the flexibility to satisfy certain time constraint (i.e., the expected duration of the trip). By considering a sequence of check-in points as a route, we mine the frequent sequences with some ranking mechanism to achieve the goal. Our TripRec targets at travelers who are unfamiliar to the objective area/city and have time constraints in the trip.
Social status and role analysis of Palin's email network BIBAFull-Text 531-532
  Xia Hu; Huan Liu
Email usage is pervasive among people from different backgrounds, and email corpus can be an important data source to study intricate social structures. Social status and role analysis on a personal email network can help reveal hidden information. The availability of Sarah Palin's email corpus presents a great opportunity to study social statuses and social roles in an email network. However, the email corpus does not readily lend itself to social network analysis due to problems such as noisy email data, scale in size, and temporal constraints. In this paper, we report an initial investigation of social status and role analysis on Sarah Palin's email corpus. In particular, we conduct a preliminary study on Palin's social statuses and roles. To the best of our knowledge, this work is the first exploration of Sarah Palin's email corpus recently released by the state of Alaska.
The affects of task difficulty on medical searches BIBAFull-Text 533-534
  Anushia Inthiran; Saadat M. Alhashmi; Pervaiz K. Ahmed
In this paper, we analyze medical searching behavior performed by a typical medical searcher. We broadly classify a typical medical searcher as: non-medical professionals or medical professionals. We use behavioral signals to study how task difficulty affects medical searching behavior. Using simulated scenarios, we gathered data from an exploratory survey of 180 search sessions performed by 60 participants. Our research study provides a deep understanding of how task difficulty affects medical search behavior. Non-medical professionals and medical professionals demonstrate similar search behavior when searching on an easy task. Longer queries, more time and more incomplete search sessions are observed for an easy task. However, they demonstrate different results evaluation behavior based on task difficulty.
Leveraging interlingual classification to improve web search BIBAFull-Text 535-536
  Jagadeesh Jagarlamudi; Paul N. Bennett; Krysta M. Svore
In this paper we address the problem of improving accuracy of web search in a smaller, data-limited search market (search language) using behavioral data from a larger, data-rich market (assist language). Specifically, we use interlingual classification to infer the search language query's intent using the assist language click-through data. We use these improved estimates of query intent, along with the query intent based on the search language data, to compute features that encode the similarity between a search result (URL) and the query. These features are subsequently fed into the ranking model to improve the relevance ranking of the documents. Our experimental results on German and French languages show the effectiveness of using assist language behavioral data especially, when the search language queries have small click-through data.
Modeling click-through based word-pairs for web search BIBAFull-Text 537-538
  Jagadeesh Jagarlamudi; Jianfeng Gao
Statistical translation models and latent semantic analysis (LSA) are two effective approaches to exploit click-through data for web search ranking. This paper presents two document ranking models that combine both approaches by explicitly modeling word-pairs. The first model, called PairModel, is a monolingual ranking model based on word pairs that are derived from click-through data. It maps queries and documents into a concept space spanned by these word pairs. The second model, called Bilingual Paired Topic Model (BPTM), uses bilingual word pairs and jointly models a bilingual query-document collection. This model maps queries and documents in multiple languages into a lower dimensional semantic subspace. Experimental results on web search task show that they significantly outperform the state-of-the-art baseline models, and the best result is obtained by interpolating PairModel and BPTM.
Google image swirl: a large-scale content-based image visualization system BIBAFull-Text 539-540
  Yushi Jing; Henry Rowley; Jingbin Wang; David Tsai; Chuck Rosenberg; Michele Covell
Web image retrieval systems, such as Google or Bing image search, present search results as a relevance-ordered list. Although alternative browsing models (e.g. results as clusters or hierarchies) have been proposed in the past, it remains to be seen whether such models can be applied to large-scale image search. This work presents Google Image Swirl, a large-scale, publicly available, hierarchical image browsing system by automatically group the search results based on visual and semantic similarity. This paper describes methods used to build such system and shares the findings from 2-years worth of user feedback and usage statistics.
Identifying sentiments over N-gram BIBAFull-Text 541-542
  Noriaki Kawamae
Our proposal, identifying sentiment over N-gram (ISN) focuses on both word order and phrases, and the interdependency between specific rating and corresponding sentiment in a text to detect subjective information.
StormRider: harnessing "storm" for social networks BIBAFull-Text 543-544
  Vaibhav V. Khadilkar; Murat Kantarcioglu; Bhavani Thuraisingham
The focus of online social media providers today has shifted from "content generation" towards finding effective methodologies for "content storage, retrieval and analysis" in the presence of evolving networks. Towards this end, in this paper we present StormRider, a framework that uses existing cloud computing and semantic web technologies to provide application programmers with automated support for these tasks, thereby allowing a richer assortment of use cases to be implemented on the underlying evolving social networks.
Learning from positive and unlabeled amazon reviews: towards identifying trustworthy reviewers BIBAFull-Text 545-546
  Marios Kokkodis
On-line marketplaces have been growing in importance over the last few years. In such environments, reviews consist the main reputation mechanism for the available products. Hence, presenting high quality reviews is crucial in achieving a high level of customer satisfaction. Towards this direction, in this work, we introduce a new dimension of review quality, the reviewer's "trustfulness". We assume that voluntary information provided by Amazon reviewers, regarding whether they are the actual buyers of the product, signals the reliability of a review. Based on this information, we characterize a reviewer as trustworthy (positive instance) or of unknown "trustfulness" (unlabeled instance). Then, we build models that exploit reviewers' profile information and on-line behavior to rank them according to the probability of being trustworthy. Our results are very promising, since they provide evidence that our predictive models separate positive from unlabeled instances with very high accuracies.
Treehugger or petrolhead?: identifying bias by comparing online news articles with political speeches BIBAFull-Text 547-548
  Ralf Krestel; Alex Wall; Wolfgang Nejdl
The Web is a very democratic medium of communication allowing everyone to express his or her opinion about any type of topic. This multitude of voices makes it more and more important to detect bias and help Internet users understand the background of information sources. Political bias of Web sites, articles, or blog posts is hard to identify straightaway. Manual content analysis conducted by experts is the standard way in political and social science to detect this bias. In this paper we present an automated approach relying on methods from information retrieval and corpus statistics to identify biased vocabulary use. As an example, we analyzed 15 years of parliamentary speeches of the German Bundestag and we investigated whether there is bias towards a political party in major national online newspapers and magazines. The results show that bias exists with respect to vocabulary use and it coincides with human judgement.
Towards optimizing the non-functional service matchmaking time BIBAFull-Text 549-550
  Kyriakos Kritikos; Dimitris Plexousakis
The Internet is moving fast to a new era where million of services and things will be available. In this way, as there will be many functionally-equivalent services for a specific user task, the service non-functional aspect should be considered for filtering and choosing the appropriate services. The related approaches in service discovery mainly concentrate on exploiting constraint solving techniques for inferring if the user non-functional requirements are satisfied by the service nonfunctional capabilities. However, as the matchmaking time is proportional to the number of non-functional service descriptions, these approaches fail to fulfill the user request in a timely manner. To this end, two alternative techniques for improving the non-functional service matchmaking time have been developed. The first one is generic as it can handle non-functional service specifications containing n-ary constraints, while the second is only applicable to unary-constrained specifications. Both techniques were experimentally evaluated. The preliminary evaluation results show that the service matchmaking time is significantly improved without compromising matchmaking accuracy.
Measuring usefulness of context for context-aware ranking BIBAFull-Text 551-552
  Andrey Kustarev; Yury Ustinovsky; Pavel Serduykov
Most of major search engines develop different types of personalisation of search results. Personalisation includes deriving user's long-term preferences, query disambiguation etc. User sessions provide very powerful tool commonly used for these problems. In this paper we focus on personalisation based on context-aware reranking. We implement a machine learning framework to approach this problem and study importance of different types of features. We stress that features concerning temporal and context relatedness of queries along with features relied on user's actions are most important and play crucial role for this type of personalisation.
TEM: a novel perspective to modeling content onmicroblogs BIBAFull-Text 553-554
  Himabindu Lakkaraju; Hyung-Il Ahn
In recent times, microblogging sites like Facebook and Twitter have gained a lot of popularity. Millions of users world wide have been using these sites to post content that interests them and also to voice their opinions on several current events. In this paper, we present a novel non-parametric probabilistic model -- Temporally driven Theme Event Model (TEM) for analyzing the content on microblogs. We also describe an online inference procedure for this model that enables its usage on large scale data. Experimentation carried out on real world data extracted from Facebook and Twitter demonstrates the efficacy of the proposed approach.
Using proximity to predict activity in social networks BIBAFull-Text 555-556
  Kristina Lerman; Suradej Intagorn; Jeon-Hyung Kang; Rumi Ghosh
The structure of a social network contains information useful for predicting its evolution. We show that structural information also helps predict activity. People who are "close" in some sense in a social network are more likely to perform similar actions than more distant people. We use network proximity to capture the degree to which people are "close" to each other. In addition to standard proximity metrics used in the link prediction task, such as neighborhood overlap, we introduce new metrics that model different types of interactions that take place between people. We study this claim empirically using data about URL forwarding activity on the social media sites Digg and Twitter. We show that structural proximity of two users in the follower graph is related to similarity of their activity, i.e., how many URLs they both forward. We also show that given friends' activity, knowing their proximity to the user can help better predict which URLs the user will forward. We compare the performance of different proximity metrics on the activity prediction task and find that metrics that take into account the attention-limited nature of interactions in social media lead to substantially better predictions.
Finding influential seed successors in social networks BIBAFull-Text 557-558
  Cheng-Te Li; Hsun-Ping Hsieh; Shou-De Lin; Man-Kwan Shan
In a dynamic social network, nodes can be removed from the network for some reasons, and consequently affect the behaviors of the network. In this paper, we tackle the challenge of finding a successor node for each removed seed node to maintain the influence spread in the network. Given a social network and a set of seed nodes for influence maximization, who are the best successors to be transferred the jobs of initial influence propagation if some seeds are removed from the network. To tackle this problem, we present and discuss five neighborhood-based selection heuristics, including degree, degree discount, overlapping, community bridge, and community degree. Experiments on DBLP co-authorship network show the effectiveness of devised heuristics.
Influence propagation and maximization for heterogeneous social networks BIBAFull-Text 559-560
  Cheng-Te Li; Shou-De Lin; Man-Kwan Shan
Influence propagation and maximization is a well-studied problem in social network mining. However, most of the previous works focus only on homogeneous social networks where nodes and links are of single type. This work aims at defining information propagation for heterogeneous social networks (containing multiple types of nodes and links). We propose to consider the individual behaviors of persons to model the influence propagation. Person nodes possess different influence probabilities to activate their friends according to their interaction behaviors. The proposed model consists of two stages. First, based on the heterogeneous social network, we create a human-based influence graph where nodes are of human-type and links carry weights that represent how special the target node is to the source node. Second, we propose two entropy-based heuristics to identify the disseminators in the influence graph to maximize the influence spread. Experimental results show promising results for the proposed method.
Dynamic selection of activation targets to boost the influence spread in social networks BIBAFull-Text 561-562
  Cheng-Te Li; Man-Kwan Shan; Shou-De Lin
This paper aims to combine the viral marketing with the idea of direct selling to for influence maximization in a social network. In direct selling, producers can sell the products directly to the consumers without having to go through a cascade of wholesalers. Through direct selling, it is possible to sell the products in a more efficient and economic manner. Motivated by this idea, we propose a target-selecting independent cascade (TIC) model, in which during influence propagation each active node can give up to attempt to influence some neighboring nodes, named victims, who could be hard to affect, and try to activate some of its friends of friends, termed destinations, who could have higher potential to increase the influence spread. Thus, the next question to ask is that given a social network and a set of seeds for influence propagation under TIC model, how to select targets (i.e., victims and destinations) for the attempts of activation during the propagation to boost of influence spread. We propose and evaluate three heuristics for the target selection. Experiments show that selecting targets based on influence probability between nodes have the highest boost of influence spread.
Regional subgraph discovery in social networks BIBAFull-Text 563-564
  Cheng-Te Li; Man-Kwan Shan; Shou-De Lin
This paper solves a region-based subgraph discovery problem. We are given a social network and some sample nodes which is supposed to belong to a specific region, and the goal is to obtain a subgraph that contains the sampled nodes with other nodes in the same region. Such regional subgraph discovery can benefit region-based applications, including scholar search, friend suggestion, and viral marketing. To deal with this problem, we assume there is a hidden backbone connecting the query nodes directly or indirectly in their region. The idea is that individuals belonging to the same region tend to share similar interests and cultures. By modeling such fact on edge weights, we search the graph to extract the regional backbone with respect to the query nodes. Then we can expand the backbone to derive the regional network. Experiments on a DBLP co-authorship network show the proposed method can effectively discover the regional subgraph with high precision scores.
GPU-based minwise hashing: GPU-based minwise hashing BIBAFull-Text 565-566
  Ping Li; Anshumali Shrivastava; Christian A. Konig
Minwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise hashing and b-bit minwise hashing require an expensive preprocessing step for applying k (e.g., k=500) permutations on the entire data in order to compute k minimal values as the hashed data. In this paper, we developed a parallelization scheme using GPUs, which reduced the processing time by a factor of 20-80. Reducing the preprocessing time is highly beneficial in practice, for example, for duplicate web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers (when the test data are not preprocessed).
CloudSpeller: query spelling correction by using a unified hidden Markov model with web-scale resources BIBAFull-Text 567-568
  Yanen Li; Huizhong Duan; ChengXiang Zhai
Query spelling correction is an important component of modern search engines that can help users to express an information need more accurately and thus improve search quality. In this work we proposed and implemented an end-to-end speller correction system, namely CloudSpeller. The CloudSpeller system uses a Hidden Markov Model to effectively model major types of spelling errors in a unified framework, in which we integrate a large-scale lexicon constructed using Wikipedia, an error model trained from high confidence correction pairs, and the Microsoft Web N-gram service. Our system achieves excellent performance on two search query spelling correction datasets, reaching 0.960 and 0.937 F1 scores on the TREC dataset and the MSN dataset respectively.
Sentiment classification via integrating multiple feature presentations BIBAFull-Text 569-570
  Yuming Lin; Jingwei Zhang; Xiaoling Wang; Aoying Zhou
In the bag of words framework, documents are often converted into vectors according to predefined features together with weighting mechanisms. Since each feature presentation has its character, it is difficult to determine which one should be chosen for a specific domain, especially for the users who are not familiar with the domain. This paper explores the integration of various feature presentations to improve the classification accuracy. A general two phases framework is proposed. In the first phase, we train multiple base classifiers with various vector spaces and use these classifiers to predict the class of testing samples respectively. In the second phase, the previous predicted results are integrated into the ultimate class via stacking with SVM. The experimental results demonstrate the effectiveness of our method.
Tuning parameters of the expected reciprocal rank BIBAFull-Text 571-572
  Yury Logachev; Lidia Grauer; Pavel Serdyukov
There are several popular IR metrics based on an underlying user model. Most of them are parameterized. Usually parameters of these metrics are chosen on the basis of general considerations and not validated by experiments with real users. Particularly, the parameters of the Expected Reciprocal Rank measure are the normalized parameters of the DCG metric, and the latter are chosen in an ad-hoc manner. We suggest two approaches for adjusting parameters of the ERR model by analyzing real users behaviour: one based on a controlled experiment and another relying on search log analysis. We show that our approaches generate parameters that are largely different from the commonly used parameters of the ERR model.
Conversations reconstruction in the social web BIBAFull-Text 573-574
  Juan Antonio Lossio Ventura; Hakim Hacid; Arnaud Ansiaux; Maria Laura Maag
We propose a socio-semantic approach for building conversations from social interactions following three steps: (i) content linkage, (ii) participants (users) linkage, and (iii) temporal linkage. Preliminary evaluations on a Twitter dataset show promising and interesting results.
Secure querying of recursive XML views: a standard xpath-based technique BIBAFull-Text 575-576
  Houari Mahfoud; Abdessamad Imine
Most state-of-the art approaches for securing XML documents allow users to access data only through authorized views defined by annotating an XML grammar (e.g. DTD) with a collection of XPath expressions. To prevent improper disclosure of confidential information, user queries posed on these views need to be rewritten into equivalent queries on the underlying documents, which enables us to avoid the overhead of view materialization and maintenance. A major concern here is that XPath query rewriting for recursive XML views is still an open problem. To overcome this problem, some authors have proposed rewriting approaches based on the non-standard language, "Regular XPath", which is more expressive than XPath and makes rewriting possible under recursion. However, query rewriting under Regular XPath can be of exponential size as it relies on automaton model. Most importantly, Regular XPath remains a theoretical achievement. Indeed, it is not commonly used in practice as translation and evaluation tools are not available. In this work, we show that query rewriting is always possible for recursive XML views using only the expressive power of the standard XPath. We propose a general approach for securely querying of XML data under arbitrary security views (recursive or not) and for a significant fragment of XPath. We provide a linear rewriting algorithm that is efficient and scales well.
GoThere: travel suggestions using geotagged photos BIBAFull-Text 577-578
  Abdul Majid; Ling Chen; Gencai Chen; Hamid Turab Mirza; Ibrar Hussain
We propose a context and preference aware travel guide that suggests significant tourist destinations to users based on their preferences and current surrounding context using contextualized user-generated contents from the social media repository, i.e., Flickr.
Ad-hoc ride sharing application using continuous SPARQL queries BIBAFull-Text 579-580
  Debnath Mukherjee; Snehasis Banerjee; Prateep Misra
In the existing ride sharing scenario, the ride taker has to cope with uncertainties since the ride giver may be delayed or may not show up due to some exigencies. A solution to this problem is discussed in this paper. The solution framework is based on gathering information from multiple streams such as traffic status on the ride giver's routes and the ride giver's GPS coordinates. Also, it maintains a list of alternative ride givers so as to almost guarantee a ride for the ride taker. This solution uses a SPARQL-based continuous query framework that is capable of sensing fast-changing real-time situation. It also has reasoning capabilities for handling ride taker's preferences. The paper introduces the concept of user-managed windows that is shown to be required for this solution. Finally we show that the performance of the application is enhanced by designing the application with short incremental queries.
Sparse linear methods with side information for Top-N recommendations BIBAFull-Text 581-582
  Xia Ning; George Karypis
This paper focuses on developing effective algorithms that utilize side information for top-N recommender systems. A set of Sparse Linear Methods with Side information (SSLIM) is proposed, that utilize a regularized optimization process to learn a sparse item-to-item coefficient matrix based on historical user-item purchase profiles and side information associated with the items. This coefficient matrix is used within an item-based recommendation framework to generate a size-N ranked list of items for a user. Our experimental results demonstrate that SSLIM outperforms other methods in effectively utilizing side information and achieving performance improvement.
Sentiment analysis amidst ambiguities in YouTube comments on Yoruba language (Nollywood) movies BIBAFull-Text 583-584
  Sylvester Olubolu Orimaye; Saadat M. Alhashmi; Siew Eu-gene
Nollywood is the second largest movie industry in the world in terms of annual movie production. A dominant number of the movies are in Yoruba language spoken by over 20 million people across the globe. The number of Yoruba language movies uploaded to YouTube and their corresponding comments is growing exponentially. However, YouTube comments made by native speakers on Yoruba movies combine English language, Yoruba language, and other commonly used "pidgin" Yoruba language words. Since Yoruba is still a resource constrained language, existing sentiment or subjectivity analysis algorithms have poor performances on YouTube comments made on Yoruba language movies. This is because of the constrained language ambiguities. In this work, we present an automatic sentiment analysis algorithm for YouTube comments on Yoruba language movies. The algorithm uses SentiWordNet thesaurus and a lexicon of commonly used Yoruba language sentiment words and phrases. In terms of precision-recall, the algorithm performs more than a state-of-the-art sentiment analysis technique by up to 20%.
C4PS: colors for privacy settings BIBAFull-Text 585-586
  Thomas Paul; Martin Stopczynski; Daniel Puscher; Melanie Volkamer; Thorsten Strufe
The ever increasing popularity of Facebook and other Online Social Networks has left a wealth of personal and private data on the web, aggregated and readily accessible for broad and automatic retrieval. Protection from both undesired recipients and harvesting by crawlers is implemented by access control, manually configured by the user and owner of the data. Several studies demonstrate that default settings cause an unnoticed over-sharing and that users have trouble understanding and configuring adequate privacy settings. We developed an improved interface for privacy settings in Facebook by mainly applying color coding for different groups, providing easy access to the privacy settings, and applying the principle of common practices. Using a lab study, we show that the new approach increases the usability significantly.
Extracting advertising keywords from URL strings BIBAFull-Text 587-588
  Santosh Raju; Raghavendra Udupa
Extracting advertising keywords from web-pages is important in keyword-based online advertising. Previous works have attempted to extract advertising keywords from the whole content of a web-page. However, in some scenarios, it is necessary to extract keywords from just the URL string itself. In this work, we propose an algorithm for extracting advertising keywords from the URL string alone. Our algorithm has applications in contextual and paid search advertising. We evaluate the effectiveness of our algorithm on publisher URLs and show that it produces very good quality keywords that are comparable with keywords produced by page based extractors.
Instrumenting a logic programming language to gather provenance from an information extraction application BIBAFull-Text 589-590
  Christine F. Reilly; Yueh-Hsuan Chiang; Jeffrey F. Naughton
Information extraction (IE) programs for the web consume and produce a lot of data. In order to better understand the program output, the developer and user often desire to know the details of how the output was created. Provenance can be used to learn about the creation of the output. We collect fine-grained provenance by leveraging ongoing work in the IE community to write IE programs in a logic programming language. The logic programming language exposes the semantics of the program, allowing us to gather fine-grained provenance during program execution. We discuss a case study using a web-based community information management system, then present results regarding the performance of queries over the provenance data gathered by our logic program interpreter. Our findings show that it is possible to gather useful fine-grained provenance during the execution of a logic based web information extraction program. Additionally, queries over this provenance information can be performed in a reasonable amount of time.
Lexical quality as a proxy for web text understandability BIBAFull-Text 591-592
  Luz Rello; Ricardo Baeza-Yates
We show that a recently introduced lexical quality measure is also valid to measure textual Web accessibility. Our measure estimates the lexical quality of a site based in the occurrence in English Web pages of a large set of words with errors. We first compute the correlation of our measure with Web popularity measures to show that gives independent information. Second, we carry out a user study using eye tracking to prove that the degree of lexical quality of a text is related to the degree of understandability of a text, one of the factors behind Web accessibility.
Latent contextual indexing of annotated documents BIBAFull-Text 593-594
  Christian Sengstock; Michael Gertz
In this paper we propose a simple and flexible framework to index context-annotated documents, e.g., documents with timestamps or georeferences, by contextual topics. A contextual topic is a distribution over document features with a particular meaning in the context domain, such as a repetitive event or a geographic phenomenon. Such a framework supports document clustering, labeling, and search, with respect to contextual knowledge contained in the document collection. To realize the framework, we introduce an approach to project documents into a context-feature space. Then, dimensionality reduction is used to extract contextual topics in this context-feature space. The topics can then be projected back onto the documents. We demonstrate the utility of our approach with a case study on georeferenced Wikipedia articles.
APOLLO: a general framework for populating ontology with named entities via random walks on graphs BIBAFull-Text 595-596
  Wei Shen; Jianyong Wang; Ping Luo; Min Wang
Automatically populating ontology with named entities extracted from the unstructured text has become a key issue for Semantic Web. This issue naturally consists of two subtasks: (1) for the entity mention whose mapping entity does not exist in the ontology, attach it to the right category in the ontology (i.e., fine-grained named entity classification), and (2) for the entity mention whose mapping entity is contained in the ontology, link it with its mapping real world entity in the ontology (i.e., entity linking). Previous studies only focus on one of the two subtasks. This paper proposes APOLLO, a general weakly supervised frAmework for POpuLating ontoLOgy with named entities. APOLLO leverages the rich semantic knowledge embedded in the Wikipedia to resolve this task via random walks on graphs. An experimental study has been conducted to show the effectiveness of APOLLO.
Multiple spreaders affect the indirect influence on Twitter BIBAFull-Text 597-598
  Xin Shuai; Ying Ding; Jerome Busemeyer
Most studies on social influence have focused on direct influence, while another interesting question can be raised as whether indirect influence exists between two users who're not directly connected in the network and what affects such influence. In addition, the theory of complex contagion tells us that more spreaders will enhance the indirect influence between two users. Our observation of intensity of indirect influence, propagated by n parallel spreaders and quantified by retweeting probability on Twitter, shows that complex contagion is validated globally but is violated locally. In other words, the retweeting probability increases non-monotonically with some local drops.
Entity based translation language model BIBAFull-Text 599-600
  Amit Singh
Bridging the lexical gap between the user's question and the question-answer pairs in Q&A archives has been a major challenge for Q&A retrieval. State-of-the-art approaches address this issue by implicitly expanding the queries with additional words using statistical translation models. In this work we extend the lexical word based translation model to incorporate semantic concepts. We explore strategies to learn the translation probabilities between words and the concepts using the Q&A archives and Wikipedia. Experiments conducted on a large scale real data from Yahoo Answers! show that the proposed techniques are promising and need further investigation.
Enabling accent resilient speech based information retrieval BIBAFull-Text 601-602
  Koushik Sinha; Geetha Manjunath; Raveesh R. Sharma; Viswanath Gangavaram; Pooja A; Deepak R. Murugaian
Voice interfaces to browsers and mobile applications are becoming popular as typing with touch screens is cumbersome. The main issue of practical speech based interfaces is how to overcome speech recognition errors. This problem is more severe when the users are non-native speakers of English due to differences in pronunciations. In this paper, we describe a novel, intelligent speech interface design approach for IR tasks that is significantly robust to accent variations. Our solution uses phonemic similarity based word spreading and semantic information based filtering to boost the accuracy of any ASR. We evaluated our solution with Google Voice as the ASR for a web question-answering system developed in-house and the results are very encouraging.
Dynamical information retrieval modelling: a portfolio-armed bandit machine approach BIBAFull-Text 603-604
  Marc Sloan; Jun Wang
The dynamic nature of document relevance is largely ignored by traditional Information Retrieval (IR) models, which assume that scores (relevance) for documents given an information need are static. In this paper, we formulate a general Dynamical Information Retrieval problem, where we consider retrieval as a stochastic, controllable process. The ranking action continuously controls the retrieval system's dynamics and an optimal ranking policy is found that maximizes the overall users' satisfaction during each period. Through deriving the posterior probability of the documents evolving relevancy from user clicks, we can provide a plug-in framework for incorporating a number of click models, which can be combined with Multi-Armed Bandit theory and Portfolio Theory of IR to create a dynamic ranking rule that takes rank bias and click dependency into account. We verify the versatility of our algorithms in a number of experiments and demonstrate improved performance over strong baselines and as a result significant performance gains have been achieved.
Detecting dynamic association among Twitter topics BIBAFull-Text 605-606
  Shuangyong Song; Qiudan Li; Hongyun Bao
Over the last few years, Twitter is increasingly becoming an important source of up-to-date topics about what is happening in the world. In this paper, we propose a dynamic topic association detection model to discover relations between Twitter topics, by which users can gain insights into richer information about topics of interest. The proposed model utilizes a time constrained method to extract event-based spatio-temporal topic association, and constructs a dynamic temporal map to represent the obtained result. Experimental results show the improvement of the proposed model compared to static spatio-temporal method and co-occurrence method.
Using community information to improve the precision of link prediction methods BIBAFull-Text 607-608
  Sucheta Soundarajan; John Hopcroft
Because network data is often incomplete, researchers consider the link prediction problem, which asks which non-existent edges in an incomplete network are most likely to exist in the complete network. Classical approaches compute the 'similarity' of two nodes, and conclude that highly similar nodes are most likely to be connected in the complete network. Here, we consider several such similarity-based measures, but supplement the similarity calculations with community information. We show that for many networks, the inclusion of community information improves the accuracy of similarity-based link prediction methods.
Open and decentralized platform for visualizing web mash-ups in augmented and mirror worlds BIBAFull-Text 609-610
  Vlad Stirbu; David Murphy; Yu You
Augmented reality applications are gaining popularity due to increased capabilities of modern mobile devices. However, existing applications are tightly integrated with backend services that expose content using proprietary interfaces. We demonstrate an architecture that allows visualization of web content in augmented and mirror world applications, based on open web protocols and formats. We describe two clients, one for creating virtual artifacts, web resources that bind together web content with location and a 3D model, and one that visualizes the virtual artifacts in the mirror world.
Actualization of query suggestions using query logs BIBAFull-Text 611-612
  Alisa Strizhevskaya; Alexey Baytin; Irina Galinskaya; Pavel Serdyukov
In this work we are studying actualization techniques for building an up-to-date query suggestions model using query logs. The performance of the proposed actualization algorithms was estimated by real query flow of the Yandex search engine.
Query spelling correction using multi-task learning BIBAFull-Text 613-614
  Xu Sun; Anshumali Shrivastava; Ping Li
This paper explores the use of online multi-task learning for search query spelling correction, by effectively transferring information from different and biased training datasets for improving spelling correction across datasets. Experiments were conducted on three query spelling correction datasets, including the well-known TREC benchmark data. Our experimental results demonstrate that the proposed method considerably outperforms existing baseline systems in terms of accuracy. Importantly, the proposed method is about one-order of magnitude faster than baseline systems in terms of training speed. In contrast to existing methods which typically require more than (e.g.,) 50 training passes, our algorithm can very closely approach the empirical optimum in around five passes.
Incorporating seasonal time series analysis with search behavior information in sales forecasting BIBAFull-Text 615-616
  Yuchen Tian; Yiqun Liu; Danqing Xu; Ting Yao; Min Zhang; Shaoping Ma
We consider the problem of predicting monthly auto sales in mainland China. First, we design an algorithm using click-through and query reformulation information to cluster related queries and count their frequencies on monthly-basis. By introducing Exponentially Weighted Moving Averages (EWMA) model, we measure the seasonal impact on the sales trend. Two features are combined using linear regression. The experiment shows that our model is effective with high accuracy and outperforms conventional forecasting models.
Photo-TaPE: user privacy preferences in photo tagging BIBAFull-Text 617-618
  Vincent Toubiana; Vincent Verdot; Benoit Christophe; Mathieu Boussard
Although they are used to expose pictures on the Web, users may not want to have a link between their identity and pictures without being able to modify them or control who accesses them. Photo tagging -- and more broadly face-recognition algorithms -- often escapes to the users' control and creates links between private situations and their public profile. To address this issue, we designed a geo-location aided system to let users declare their tagging preferences directly when the picture is taken. We present Photo-Tagging Preference Enforcement (Photo-TaPE) a system enforcing users tagging preferences without revealing their identity. By improving face-recognition efficiency, Photo-TaPE can guarantee the user tagging preferences in 67% of the cases and significantly reduces the processing time of face-recognition algorithms.
Understanding human movement semantics: a point of interest based approach BIBAFull-Text 619-620
  Ionut Trestian; Kévin Huguenin; Ling Su; Aleksandar Kuzmanovic
The recent availability of human mobility traces has driven a new wave of research on human movement with straightforward applications in wireless/cellular network. In this paper we revisit the human mobility problem with new assumptions. We believe that human movement is not independent of the surrounding locations, i.e. the points of interest that they visit; most of the time people travel with specific goals in mind, visit specific points of interest, and frequently revisit favorite places. Using GPS mobility traces of a large number of users located across two distinct geographical locations we study the correlation between people's trajectories and the differently spread points of interest nearby.
Scalable multi stage clustering of tagged micro-messages BIBAFull-Text 621-622
  Oren Tsur; Adi Littman; Ari Rappoport
The growing popularity of microblogging backed by services like Twitter, Facebook, Google+ and LinkedIn, raises the challenge of clustering short and extremely sparse documents. In this work we propose SMSC -- a scalable, accurate and efficient multi stage clustering algorithm. Our algorithm leverages users practice of adding tags to some messages by bootstrapping over virtual non sparse documents. We experiment on a large corpus of tweets from Twitter, and evaluate results against a gold-standard classification validated by seven clustering evaluation measures (information theoretic, paired and greedy). Results show that the algorithm presented is both accurate and efficient, significantly outperforming other algorithms. Under reasonable practical assumptions, our algorithm scales up sublinearly in time.
Seeing the best and worst of everything on the web with a two-level, feature-rich affect lexicon BIBAFull-Text 623-624
  Tony Veale
Affect lexica are useful for sentiment analysis because they map words (or senses) onto sentiment ratings. However, few lexica explain their ratings, or provide sufficient feature richness to allow a selective "spin"" to be placed on a word in context. Since an affect lexicon aims to capture the affect of a word or sense in its most stereotypical usage, it should be grounded in explicit stereotype representations of each word's most salient properties and behaviors. We show here how to acquire a large stereotype lexicon from Web content, and further show how to determine sentiment ratings for each entry in the lexicon, both at the level of properties and behaviors and at the level of stereotypes. Finally, we show how the properties of a stereotype can be segregated on demand, to place a positive or negative spin on a word in context.
Unified classification model for geotagging websites BIBAFull-Text 625-626
  Alexey Volkov; Pavel Serdyukov
The paper presents a novel approach to finding regional scopes (geotagging) of websites. It relies on a single binary classification model per region type to perform the multi-class classification and uses a variety of features of different nature that have not been yet used together for machine-learning based regional classification of websites. The evaluation demonstrates the advantage of our "one model per region type" method versus the traditional "one model per region" approach.
Selling futures online advertising slots via option contracts BIBAFull-Text 627-628
  Jun Wang; Bowei Chen
Many online advertising slots are sold through bidding mechanisms by publishers and search engines. Highly affected by the dual force of supply and demand, the prices of advertising slots vary significantly over time. This then influences the businesses whose major revenues are driven by online advertising, particularly for publishers and search engines. To address the problem, we propose to sell the future advertising slots via option contracts (also called ad options). The ad option can give its buyer the right to buy the future advertising slots at a prefixed price. The pricing model of ad options is developed in order to reduce the volatility of the income of publishers or search engines. Our experimental results confirm the validity of ad options and the embedded risk management mechanisms.
Model news relatedness through user comments BIBAFull-Text 629-630
  Xuanhui Wang; Jian Bian; Yi Chang; Belle Tseng
Most of previous work on news relatedness focuses on news article texts. In this paper, we study the benefit of user-generated comments on modeling news relatedness. Comments contain rich text information which is provided by commenters and rated by readers with thumb-up or thumb-down, but the quality of individual comments varies widely. We compare different ways of capturing relatedness by leveraging both text and user interaction information in comments. Our evaluation based on an editorial data set demonstrates that the text information in comments is very effective to model relatedness while community rating is quite predictive of the comment quality.
A data-driven sketch of Wikipedia editors BIBAFull-Text 631-632
  Robert West; Ingmar Weber; Carlos Castillo
Who edits Wikipedia? We attempt to shed light on this question by using aggregated log data from Yahoo!'s browser toolbar in order to analyze Wikipedians' editing behavior in the context of their online lives beyond Wikipedia. We broadly characterize editors by investigating how their online behavior differs from that of other users; e.g., we find that Wikipedia editors search more, read more news, play more games, and, perhaps surprisingly, are more immersed in pop culture. Then we inspect how editors' general interests relate to the articles to which they contribute; e.g., we confirm the intuition that editors show more expertise in their active domains than average users. Our results are relevant as they illuminate novel aspects of what has become many Web users' prevalent source of information and can help in recruiting new editors.
A framework to represent and mine knowledge evolution from Wikipedia revisions BIBAFull-Text 633-634
  Xian Wu; Wei Fan; Meilun Sheng; Li Zhang; Xiaoxiao Shi; Zhong Su; Yong Yu
State-of-the-art knowledge representation in semantic web employs a triple format (subject-relation-object). The limitation is that it can only represent static information, but cannot easily encode revisions of semantic web and knowledge evolution. In reality, knowledge does not stay still but evolves over time. In this paper, we first introduce the concept of "quintuple representation" by adding two new fields, state and time, where state has two values, either in or out, to denote that the referred knowledge takes effective or becomes expired at the given time. We then discuss a two-step statistical framework to mine knowledge evolution into the proposed quintuple representation. Utilizing extracted quintuple properly, it not only can reveal knowledge changing history but also detect expired information. We evaluate the proposed framework on Wikipedia revisions, as well as, common web pages currently not in semantic web format.
Review spam detection via time series pattern discovery BIBAFull-Text 635-636
  Sihong Xie; Guan Wang; Shuyang Lin; Philip S. Yu
Online reviews play a crucial role in today's electronic commerce. Due to the pervasive spam reviews, customers can be misled to buy low-quality products, while decent stores can be defamed by malicious reviews. We observe that, in reality, a great portion (> 90% in the data we study) of the reviewers write only one review (singleton review). These reviews are so enormous in number that they can almost determine a store's rating and impression. However, existing methods ignore these reviewers. To address this problem, we observe that the normal reviewers' arrival pattern is stable and uncorrelated to their rating pattern temporally. In contrast, spam attacks are usually bursty and either positively or negatively correlated to the rating. Thus, we propose to detect such attacks via unusually correlated temporal patterns. We identify and construct multidimensional time series based on aggregate statistics, in order to depict and mine such correlation. Experimental results show that the proposed method is effective in detecting singleton review attacks. We discover that singleton review is a significant source of spam reviews and largely affects the ratings of online stores.
Combining classification with clustering for web person disambiguation BIBAFull-Text 637-638
  Jian Xu; Qin Lu; Zhengzhong Liu
Web Person Disambiguation is often conducted through clustering web documents to identify different namesakes for a given name. This paper presents a new key-phrased clustering method combined with a second step re-classification to identify outliers to improve cluster performance. For document clustering, the hierarchical agglomerative approach is conducted based on the vector space model which uses key phrases as the main feature. Outliers of cluster results are then identified through a centroids-based method. The outliers are then reclassified by the SVM classifier into the more appropriate clusters using a key phrase-based string kernel model as its feature space. The re-classification uses the clustering result in the first step as its training data so as to avoid the use of separate training data required by most classification algorithms. Experiments conducted on the WePS-2 dataset show that the algorithm based on key phrases is effective in improving the WPD performance.
Exploiting various implicit feedback for collaborative filtering BIBAFull-Text 639-640
  Byoungju Yang; Sangkeun Lee; Sungchan Park; Sang-goo Lee
So far, many researchers have worked on recommender systems using users' implicit feedback, since it is difficult to collect explicit item preferences in most applications. Existing researches generally use a pseudo-rating matrix by adding up the number of item consumption; however, this naive approach may not capture user preferences correctly in that many other important user activities are ignored. In this paper, we show that users' diverse implicit feedbacks can be significantly used to improve recommendation accuracy. We classify various users' behaviors (e.g., search item, skip, add to playlist, etc.) into positive or negative feedback groups and construct more accurate pseudo-rating matrix. Our preliminary experimental result shows significant potential of our approach. Also, we bring out a question to the previous approaches, aggregating item usage count into ratings.
The effect of links on networked user engagement BIBAFull-Text 641-642
  Elad Yom-Tov; Mounia Lalmas; Georges Dupret; Ricardo Baeza-Yates; Pinard Donmez; Janette Lehmann
In the online world, user engagement refers to the phenomena associated with being captivated by a web application and wanting to use it longer and frequently. Nowadays, many providers operate multiple content sites, very different from each other. Due to their extremely varied content, these are usually studied and optimized separately. However, user engagement should be examined not only within individual sites, but also across sites, that is the entire content provider network. In previous work, we investigated networked user engagement, by defining a global measure of engagement that captures the effect that sites have on the engagement on other sites within the same browsing session. Here, we look at the effect of links on networked user engagement, as these are commonly used by online content providers to increase user engagement.
Investigating bias in traditional media through social media BIBAFull-Text 643-644
  Arjumand Younus; Muhammad Atif Qureshi; Suneel Kumar Kingrani; Muhammad Saeed; Nasir Touheed; Colm O'Riordan; Pasi Gabriella
It is often the case that traditional media provide coverage of a news event on the basis of journalists' viewpoints -- a problem termed in the literature as media bias. On the other hand social media have given birth to an alternative paradigm of journalism known as "citizen journalism". We take advantage of citizen journalism to detect the bias in traditional media and propose a simple model for empirical measurement of media bias.
Enhancing naive Bayes with various smoothing methods for short text classification BIBAFull-Text 645-646
  Quan Yuan; Gao Cong; Nadia Magnenat Thalmann
Partly due to the proliferance of microblog, short texts are becoming prominent. A huge number of short texts are generated every day, which calls for a method that can efficiently accommodate new data to incrementally adjust classification models. Naive Bayes meets such a need. We apply several smoothing models to Naive Bayes for question topic classification, as an example of short text classification, and study their performance. The experimental results on a large real question data show that the smoothing methods are able to significantly improve the question classification performance of Naive Bayes. We also study the effect of training data size, and question length on performance.
Filtering and ranking schemes for finding inclusion dependencies on the web BIBAFull-Text 647-648
  Erika Yumiya; Atsuyuki Morishima; Masami Takahashi; Shigeo Sugimoto; Hiroyuki Kitagawa
This paper addresses the problem of finding inclusion dependencies on the Web. In our approach, we enumerate pairs of HTML/XML elements that possibly represent inclusion dependencies and then rank the results for verification. This paper focuses on the challenges in the finding and ranking processes.
Exploiting shopping and reviewing behavior to re-score online evaluations BIBAFull-Text 649-650
  Rong Zhang; ChaoFeng Sha; Minqi Zhou; Aoying Zhou
Analysis to product reviews has attracted great attention from both academia and industry. Generally the evaluation scores of reviews are used to generate the average scores of products and shops for future potential users. However, in the real world, there is the inconsistency problem between the evaluation scores and review content, and some customers do not give out fair reviews. In this work, we focus on detecting the credibility of customers by analyzing online shopping and review behaviors, and then we re-score the reviews for products and shops. In the end, we evaluate our algorithm based on the real data set from Taobao, the biggest E-commerce site in China.

SWDM'12 workshop 1

Information cascades in social media in response to a crisis: a preliminary model and a case study BIBAFull-Text 653-656
  Cindy Hui; Yulia Tyshchuk; William A. Wallace; Malik Magdon-Ismail; Mark Goldberg
The focus of this paper is on demonstrating how a model of the diffusion of actionable information can be used to study information cascades on Twitter that are in response to an actual crisis event, and its concomitant alerts and warning messages from emergency managers. We will: identify the types of information requested or shared during a crisis situation; show how messages spread among the users on Twitter including what kinds of information cascades or patterns are observed; and note what these patterns tell us about information flow and the users. We conclude by noting that emergency managers can use this information to either facilitate the spreading of accurate information or impede the flow of inaccurate or improper messages.
User community reconstruction using sampled microblogging data BIBAFull-Text 657-660
  Miki Enoki; Yohei Ikawa; Raymond Rudy
User community recognition in social media services is important to identify hot topics or users' interests and concerns in a timely way when a disaster has occurred. In microblogging services, many short messages are posted every day and some of them represent replies or forwarded messages between users. We extract such conversational messages to link the users as a user network and regard the strongly-connected components in the network as indicators of user communities. However, using all of the microblog data for user community extraction is too costly and requires too much storage space when decomposing strongly-connected components. In contrast, using sampled data may miss some user connections and thus divide one user community into pieces. In this paper, we propose a method for user community reconstruction using the lexical similarity of the messages and the user's link information between separate communities.
Towards situational pattern mining from microblogging activity BIBAFull-Text 661-666
  Nathan Gnanasambandam; Keith Thompson; Ion Florie Ho; Sarah Lam; Sang Won Yoon
Many useful patterns can be derived from analyzing microblogging behavior at different scales (individual and social group). In this paper, we derive patterns relating to spatio-temporal traffic flow, visit regularity, content and social ties as they relate to an individual's activities in an urban environment (e.g., New York City). We also demonstrate, through an example, methods for reasoning about the activities, locations and group structures that may underlie the microblogging messages in the aforementioned context of mining situation patterns. These individual and group situational patterns may be very crucial when planning for disruptions and organized response.
Mining conversations of geographically changing users BIBAFull-Text 667-670
  Liam McNamara; Christian Rohner
In recent disaster events, social media has proven to be an effective communication tool for affected people. The corpus of generated messages contains valuable information about the situation, needs, and locations of victims. We propose an approach to extract significant aspects of user discussions to better inform responders and enable an appropriate response. The methodology combines location based division of users together with standard text mining (term frequency inverse document frequency) to identify important topics of conversation in a dynamic geographic network. We further suggest that both topics and movement patterns change during a disaster, which requires identification of new trends. When applied to an area that has suffered a disaster, this approach can provide 'sensemaking' through insights into where people are located, where they are going and what they communicate when moving.
Characterization of social media response to natural disasters BIBAFull-Text 671-674
  Seema Nagar; Aaditeshwar Seth; Anupam Joshi
Online social networking websites such as Twitter and Facebook often serve a breaking-news role for natural disasters: these websites are among the first ones to mention the news, and because they are visited by millions of users regularly the websites also help communicate the news to a large mass of people. In this paper, we examine how news about these disasters spreads on the social network. In addition to this, we also examine the countries of the Tweeting users. We examine Twitter logs from the 2010 Philippines typhoon, the 2011 Brazil flood and the 2011 Japan earthquake. We find that although news about the disaster may be initiated in multiple places in the social network, it quickly finds a core community that is interested in the disaster, and has little chance to escape the community via social network links alone. We also find evidence that the world at large expresses concern about such largescale disasters, and not just countries geographically proximate to the epicenter of the disaster. Our analysis has implications for the design of fund raising campaigns through social networking websites.
Rumor spreading and inoculation of nodes in complex networks BIBAFull-Text 675-678
  Anurag Singh; Yatindra Nath Singh
Over the Internet or on social networks rumors can spread and can affect the society in disaster. The question one asks about this phenomenon is that whether these rumors can be suppressed using suitable mechanisms. One of the possible solutions is to inoculate a certain fraction of nodes against rumors. The inoculation can be done randomly or in targeted fashion. In this paper, small world network model has been used to investigate the efficiency of inoculation. It has been found that if average degree of small world network is small than both inoculation methods are successful. When average degree is large, neither of these methods are able to stop rumor spreading. But if acceptability of rumor is reduced along with inoculation, the rumor spreading can be stopped even in this case.
   The proposed hypothesis has been verified using simulation experiments.
Bursty event detection from text streams for disaster management BIBAFull-Text 679-682
  Sungjun Lee; Sangjin Lee; Kwanho Kim; Jonghun Park
In this paper, an approach to automatically identifying bursty events from multiple text streams is presented. We investigate the characteristics of bursty terms that appear in the documents generated from text streams, and incorporate those characteristics into a term weighting scheme that distinguishes bursty terms from other non-bursty terms. Experimental results based on the news corpus show that our approach outperforms the existing alternatives in extracting bursty terms from multiple text streams. The proposed research is expected to contribute to increasing the situational awareness of ongoing events particularly when a natural or economic disaster occurs.
Automatic sub-event detection in emergency management using social media BIBAFull-Text 683-686
  Daniela Pohl; Abdelhamid Bouchachia; Hermann Hellwagner
Emergency management is about assessing critical situations, followed by decision making as a key step. Clearly, information is crucial in this two-step process. The technology of social (multi)media turns out to be an interesting source for collecting information about an emergency situation. In particular, situational information can be captured in form of pictures, videos, or text messages. The present paper investigates the application of multimedia metadata to identify the set of sub-events related to an emergency situation. The used metadata is compiled from Flickr and YouTube during an emergency situation, where the identification of the events relies on clustering. Initial results presented in this paper show how social media data can be used to detect different sub-events in a critical situation.
Location inference using microblog messages BIBAFull-Text 687-690
  Yohei Ikawa; Miki Enoki; Michiaki Tatsubori
In order to sense and analyze disaster information from social media, microblogs as sources of social data have recently attracted attention. In this paper, we attempt to discover geolocation information from microblog messages to assess disasters. Since microblog services are more timely compared to other social media, understanding the geolocation information of each microblog message is useful for quickly responding to a sudden disasters. Some microblog services provide a function for adding geolocation information to messages from mobile device equipped with GPS detectors. However, few users use this function, so most messages do not have geolocation information. Therefore, we attempt to discover the location where a message was generated by using its textual content. The proposed method learns associations between a location and its relevant keywords from past messages, and guesses where a new message came from.
SocialEMIS: improving emergency preparedness through collaboration BIBAFull-Text 691-694
  Ouejdane Mejri; Pierluigi Plebani
The definition of the contingency plan during the preparedness phase holds a crucial role in emergency management. A proper emergency response, indeed, requires the implementation of a contingency plan that can be accurate only if different people with different skills are involved. The goal of this paper is to introduce SocialEMIS, a first prototype of a tool that supports the collaborative definition of contingency plans. Although the current implementation is now focused on the role of the emergency operators, the accuracy of the plan will also take advantage of information coming from the citizens in future releases. Moreover, the contingency plans defined with SocialEMIS represent a knowledge base for defining other contingency plans.
Emergency situation awareness from Twitter for crisis management BIBAFull-Text 695-698
  Mark A. Cameron; Robert Power; Bella Robinson; Jie Yin
This paper describes ongoing work with the Australian Government to detect, assess, summarise, and report messages of interest for crisis coordination published by Twitter. The developed platform and client tools, collectively termed the Emergency Situation Awareness -- Automated Web Text Mining (ESA-AWTM) system, demonstrate how relevant Twitter messages can be identified and utilised to inform the situation awareness of an emergency incident as it unfolds.
   A description of the ESA-AWTM platform is presented detailing how it may be used for real life emergency management scenarios. These scenarios are focused on general use cases to provide: evidence of pre-incident activity; near-real-time notification of an incident occurring; first-hand reports of incident impacts; and gauging the community response to an emergency warning. Our tools have recently been deployed in a trial for use by crisis coordinators.
MECA: mobile edge capture and analysis middleware for social sensing applications BIBAFull-Text 699-702
  Fan Ye; Raghu Ganti; Raheleh Dimaghani; Keith Grueneberg; Seraphin Calo
In this paper, we propose and develop MECA, a common middleware infrastructure for data collection from mobile devices in an efficient, flexible, and scalable manner. It provides a high level abstraction of phenomenon such that applications can express diverse data needs in a declarative fashion. MECA coordinates the data collection and primitive processing activities, so that data can be shared among applications. It addresses the inefficiency issues in the current vertical integration approach. We showcase the benefits of MECA by means of a disaster management application.
The use of social media within the global disaster alert and coordination system (GDACS) BIBAFull-Text 703-706
  Beate Stollberg; Tom de Groeve
The Global Disaster Alert and Coordination System (GDACS) collects near real-time hazard information to provide global multi-hazard disaster alerting for earthquakes, tsunamis, tropical cyclones, floods and volcanoes. GDACS alerts are based on calculations from physical disaster parameters and used by emergency responders. In 2011, the Joint Research Centre (JRC) of the European Commission started exploring if and how social media could be an additional valuable data source for international disaster response. The question is if awareness of the situation after a disaster could be improved by the use of social media tools and data. In order to explore this, JRC developed a Twitter account and Facebook page for the dissemination of GDACS alerts, a Twitter parser for the monitoring of information and a mobile application for information exchange. This paper presents the Twitter parser and the intermediate results of the data analysis which shows that the parsing of Twitter feeds (so-called tweets) can provide important information about side effects of disasters, on the perceived impact of a hazard and on the reaction of the affected population. The most important result is that impact information on collapsed buildings were detected through tweets within the first half an hour after an earthquake occurred and before any mass media reported the collapse.
Evaluating the impact of incorporating information from social media streams in disaster relief routing BIBAFull-Text 707-708
  Ashlea Bennett Milburn; Clarence L. Wardell
In this paper, we describe a model that can be used to evaluate the impact of using imperfect information when routing supplies for disaster relief. Using two objectives, maximizing the population supported, and minimizing response time, we explore the potential tradeoffs (e.g. more information, but possibly less accurate) of using information from social media streams to inform routing and resource allocation decisions immediately after a disaster.
Tweeting about the tsunami?: mining Twitter for information on the Tohoku earthquake and tsunami BIBAFull-Text 709-710
  Akiko Murakami; Tetsuya Nasukawa
On 11th March 2011, a 9.0-magnitude megathrust earthquake occurred in the ocean near Japan. This was the first large-scale natural disaster in Japan since the broad adoption of social media tools (such as Facebook and Twitter). In particular, Twitter is suitable for broadcasting information, naturally making it the most frequently used social medias when disasters strike. This paper presents a topical analysis using text mining tools and shows the tools' effectiveness for the analysis of social media data analysis after a disaster. Though an ad hoc system without prepared resources was useful, an improved system with some syntactic pattern dictionaries showed better results.
Mass and social media corpus analysis after the 2011 great east Japan earthquake BIBAFull-Text 711-712
  Shosuke Sato; Michiaki Tatsubori; Fumihiko Imamura
In this paper, we outline our analysis of mass media and social media as used for disaster management. We looked at the differences among multiple sub-corpuses to find relatively unique keywords based on chronologies, geographic locations, or media types. We are currently analyzing a massive corpus collected from Internet news sources and Twitter after the Great East Japan Earthquake.
Social media and SMS in the Haiti earthquake BIBAFull-Text 713-714
  Julie Dugdale; Bartel Van de Walle; Corinna Koeppinghoff
We describe some first results of an empirical study describing how social media and SMS were used in coordinating humanitarian relief after the Haiti Earthquake in January 2010. Current information systems for crisis management are increasingly incorporating information obtained from citizens transmitted via social media and SMS. This information proves particularly useful at the aggregate level. However it has led to some problems: information overload and processing difficulties, variable speed of information delivery, managing volunteer communities, and the high risk of receiving inaccurate or incorrect information.
Social web in disaster archives BIBAFull-Text 715-716
  Michiaki Tatsubori; Hideo Watanabe; Akihiro Shibayama; Shosuke Sato; Fumihiko Imamura
Preserving social Web datasets is a crucial part of research work for disaster management based on information from social media. This paper describes the Michinoku Shinrokuden disaster archive project, mainly dedicated to archiving data from the 2011 Great East Japan Earthquake and its aftermath. Social websites should of course be part of this archive. We discuss issues in archiving social websites for the disaster management research communities and introduce our vision for Michinoku Shinrokuden.

XperienceWeb'12 Workshop 2

Extraction of onomatopoeia used for foods from food reviews and its application to restaurant search BIBAFull-Text 719-728
  Ayumi Kato; Yusuke Fukazawa; Tomomasa Sato; Taketoshi Mori
Onomatopoeia is widely used in food reviews about food or restaurants. In this paper, we propose and evaluate a method to extract onomatopoeia including unknown ones automatically from food reviews sites. From the evaluation result, we found that we can extract onomatopoeia for specific foods with more than 46% precision; we find 18 unknown onomatopoeia, i.e. not registered in an existing onomatopoeia dictionary, in 62 extracted onomatopoeia. In addition, we propose a system that can present the user with a list of onomatopoeia specific to a restaurant she is interested in. The evaluation results indicate that an intuitive restaurant search can be done via a list of onomatopoeia, and that they are helpful for selecting food or restaurants.
Solution mining for specific contextualised problems: towards an approach for experience mining BIBAFull-Text 729-738
  Christian Severin Sauer; Thomas Roth-Berghofer
In this paper we describe the task of automated mining for solutions to highly specific problems. We do so under the premise of mapping the split view on context, introduced by Brézillon and Pomerol, onto three different levels of abstraction of a problem domain. This is done to integrate the notion of activity or focus and its influence on the context into the mining for a solution. We assume that a problem's context describes key characteristics to be decisive criteria in the mining process to mine successful solutions for it. We further detail on the process of a chain of sub problems and their foci adding up to a meta problem solution and how this can used to mine for such solutions. Through a guiding example we introduce basic steps of the solution mining process and common aspects we deem interesting to be analysed closer in upcoming research on solution mining. We further examine the possible integration of these newly established outlines for automatic solution mining for highly specific problems into a SEASALTexp, a currently developed architecture for explanation-aware extraction and case-based processing of experiences from Internet communities. We thereby gained first insights in issues occurring while trying to integrate automatic solution mining.
Extraction of procedural knowledge from the web: a comparison of two workflow extraction approaches BIBAFull-Text 739-745
  Pol Schumacher; Mirjam Minor; Kirstin Walter; Ralph Bergmann
User generated Web content includes large amounts of procedural knowledge (also called how to knowledge). This paper is on a comparison of two extraction methods for procedural knowledge from the Web. Both methods create workflow representations automatically from text with the aim to reuse the Web experience by reasoning methods. Two variants of the workflow extraction process are introduced and evaluated by experiments with cooking recipes as a sample domain. The first variant is a term-based approach that integrates standard information extraction methods from the GATE system. The second variant is a frame-based approach that is implemented by means of the SUNDANCE system. The expert assessment of the extraction results clearly shows that the more sophisticated frame-based approach outperforms the term-based approach of automated workflow extraction.
Collecting, reusing and executing private workflows on social network platforms BIBAFull-Text 747-750
  Sebastian Görg; Ralph Bergmann; Mirjam Minor; Sarah Gessinger; Siblee Islam
We propose a personal workflow management service as part of a social network that enables private users to construct personal workflows according to their specific needs and to keep track of the workflow execution. Unlike traditional workflows, such personal workflows aim at supporting processes that contain personal tasks and data. Our proposal includes a process-oriented case-based reasoning approach to support private users to obtain an appropriate personal workflow through sharing and reuse of respective experience.
Contextual trace-based video recommendations BIBAFull-Text 751-754
  Raafat Zarka; Amélie Cordier; Elöd Egyed-Zsigmond; Alain Mille
People like creating their own videos by mixing various contents. Many applications allow us to generate video clips by merging different media like videos clips, photos, text and sounds. Some of these applications enable us to combine online content with our own resources. Given the large amount of content available, the problem is to quickly find content that truly meet our needs. This is when recommender systems come in. In this paper, we propose an approach for contextual video recommendations based on a Trace-Based Reasoning approach.
Learning from users' querying experience on intranets BIBAFull-Text 755-764
  Ibrahim Adepoju Adeyanju; Dawei Song; M-Dyaa Albakour; Udo Kruschwitz; Anne De Roeck; Maria Fasli
Query recommendation is becoming a common feature of web search engines especially those for Intranets where the context is more restrictive. This is because of its utility for supporting users to find relevant information in less time by using the most suitable query terms. Selection of queries for recommendation is typically done by mining web documents or search logs of previous users. We propose the integration of these approaches by combining two models namely the concept hierarchy, typically built from an Intranet's documents, and the query flow graph, typically built from search logs. However, we build our concept hierarchy model from terms extracted from a subset (training set) of search logs since these are more representative of the user view of the domain than any concepts extracted from the collection. We then continually adapt the model by incorporating query refinements from another subset (test set) of the user search logs. This process implies learning from or reusing previous users' querying experience to recommend queries for a new but similar user query. The adaptation weights are extracted from a query flow graph built with the same logs. We evaluated our hybrid model using documents crawled from the Intranet of an academic institution and its search logs. The hybrid model was then compared to a concept hierarchy model and query flow graph built from the same collection and search logs respectively. We also tested various strategies for combining information in the search logs with respect to the frequency of clicked documents after query refinement. Our hybrid model significantly outperformed the concept hierarchy model and query flow graph when tested over two different periods of the academic year. We intend to further validate our experiments with documents and search logs from another institution and devise better strategies for selecting queries for recommendation from the hybrid model.

CQA'12 workshop 3

Exploiting user profile information for answer ranking in cQA BIBAFull-Text 767-774
  Zhi-Min Zhou; Man Lan; Zheng-Yu Niu; Yue Lu
Answer ranking is very important for cQA services due to the high variance in the quality of answers. Most existing works in this area focus on using various features or employing machine learning techniques to address this problem. Only a few of them noticed and involved user profile information in this particular task. In this work, we assume the close relationship between user profile information and the quality of their answers under the ground truth that user information records the user behaviors and histories as a summary. Thus, we exploited the effectiveness of three categories of user profile information, i.e. engagement-related, authority-related and level-related, on answer ranking in cQA. Different from previous work, we only employed the information which is easy to extract without any limitations, such as user privacy. Experimental results on Yahoo! Answers manner questions showed that our system by using the user profile information achieved comparable or even better results over the state-of-the-art baseline system. Moreover, we found that the picture existence of a user in cQA community contributed more than other information in the answer ranking task.
Analyzing and predicting question quality in community question answering services BIBAFull-Text 775-782
  Baichuan Li; Tan Jin; Michael R. Lyu; Irwin King; Barley Mak
Users tend to ask and answer questions in community question answering (CQA) services to seek information and share knowledge. A corollary is that myriad of questions and answers appear in CQA service. Accordingly, volumes of studies have been taken to explore the answer quality so as to provide a preliminary screening for better answers. However, to our knowledge, less attention has so far been paid to question quality in CQA. Knowing question quality provides us with finding and recommending good questions together with identifying bad ones which hinder the CQA service. In this paper, we are conducting two studies to investigate the question quality issue. The first study analyzes the factors of question quality and finds that the interaction between askers and topics results in the differences of question quality. Based on this finding, in the second study we propose a Mutual Reinforcement-based Label Propagation (MRLP) algorithm to predict question quality. We experiment with Yahoo!~Answers data and the results demonstrate the effectiveness of our algorithm in distinguishing high-quality questions from low-quality ones.
A classification-based approach to question routing in community question answering BIBAFull-Text 783-790
  Tom Chao Zhou; Michael R. Lyu; Irwin King
Community-based Question and Answering (CQA) services have brought users to a new era of knowledge dissemination by allowing users to ask questions and to answer other users' questions. However, due to the fast increasing of posted questions and the lack of an effective way to find interesting questions, there is a serious gap between posted questions and potential answerers. This gap may degrade a CQA service's performance as well as reduce users' loyalty to the system. To bridge the gap, we present a new approach to Question Routing, which aims at routing questions to participants who are likely to provide answers. We consider the problem of question routing as a classification task, and develop a variety of local and global features which capture different aspects of questions, users, and their relations. Our experimental results obtained from an evaluation over the Yahoo!~Answers dataset demonstrate high feasibility of question routing. We also perform a systematical comparison on how different types of features contribute to the final results and show that question-user relationship features play a key role in improving the overall performance.
Finding expert users in community question answering BIBAFull-Text 791-798
  Fatemeh Riahi; Zainab Zolaktaf; Mahdi Shafiei; Evangelos Milios
Community Question Answering (CQA) websites provide a rapidly growing source of information in many areas. This rapid growth, while offering new opportunities, puts forward new challenges. In most CQA implementations there is little effort in directing new questions to the right group of experts. This means that experts are not provided with questions matching their expertise, and therefore new matching questions may be missed and not receive a proper answer. We focus on finding experts for a newly posted question. We investigate the suitability of two statistical topic models for solving this issue and compare these methods against more traditional Information Retrieval approaches. We show that for a dataset constructed from the Stackoverflow website, these topic models outperform other methods in retrieving a candidate set of best experts for a question. We also show that the Segmented Topic Model gives consistently better performance compared to the Latent Dirichlet Allocation Model.
QAque: faceted query expansion techniques for exploratory search using community QA resources BIBAFull-Text 799-806
  Atsushi Otsuka; Yohei Seki; Noriko Kando; Tetsuji Satoh
Recently, query suggestions have become quite useful in web searches. Most provide additional and correct terms based on the initial query entered by users. However, query suggestions often recommend queries that differ from the user's search intentions due to different contexts. In such cases, faceted query expansions and their usages are quite efficient. In this paper, we propose faceted query expansion methods using the resources of Community Question Answering (CQA), which is social network service (SNS) that shares user knowledge. In a CQA site, users can post questions in a suitable category. Others answer them based on the category framework. Thus, the CQA "category" makes a "facet" of the query expansion. In addition, the time of year when the question was posted plays an important role in understanding its context. Thus, such seasonality creates another "facet" of the query expansion. We implement two-dimensional faceted query expansion methods based on the results of the Latent Dirichlet Allocation (LDA) analysis of CQA resources. The question articles deriving query expansion are provided for choosing appropriate terms by users. Our sophisticated evaluations using actual and long-term CQA resources, such as "Yahoo! CHIEBUKURO," demonstrate that most parts of the CQA questions are posted in periodicity and in bursts.
Socio-semantic conversational information access BIBAFull-Text 807-814
  Saurav Sahay; Ashwin Ram
We develop an innovative approach to delivering relevant information using a combination of socio-semantic search and filtering approaches. The goal is to facilitate timely and relevant information access through the medium of conversations by mixing past community specific conversational knowledge and web information access to recommend and connect users and information together. Conversational Information Access is a socio-semantic search and recommendation activity with the goal to interactively engage people in conversations by receiving agent supported recommendations. It is useful because people engage in online social discussions unlike solitary search; the agent brings in relevant information as well as identifies relevant users; participants provide feedback during the conversation that the agent uses to improve its recommendations.
Why do you ask this? BIBAFull-Text 815-822
  Giovanni Gardelli; Ingmar Weber
We use Yahoo!~Toolbar data to gain insights into why people use Q&A sites. For this purpose we look at tens of thousands of questions asked on both Yahoo!~Answers and on Wiki Answers. We analyze both the pre-question behavior of users as well as their general online behavior. Using an existing approach (Harper et al.), we classify questions into "informational" vs. "conversational". Finally, for a subset of users on Yahoo! Answers we also integrate age and gender into our analysis. Our results indicate that there is a one-dimensional spectrum of users ranging from "social users" to "informational users". In terms of demographics, we found that both younger and female users are more "social" on this scale, with older and male users being more "informational".
   Concerning the pre-question behavior, users who first issue a question-related query, and especially those who do not click any web results, are more likely to issue informational questions than users who do not search before. Questions asked shortly after the registration of a new user on Yahoo! Answers tend to be social and have a lower probability of being preceded by a web search than other questions.
   Finally, we observed evidence both for and against topical congruence between a user's questions and his web queries.
Understanding user intent in community question answering BIBAFull-Text 823-828
  Long Chen; Dell Zhang; Levene Mark
Community Question Answering (CQA) services, such as Yahoo! Answers, are specifically designed to address the innate limitation of Web search engines by helping users obtain information from a community. Understanding the user intent of questions would enable a CQA system identify similar questions, find relevant answers, and recommend potential answerers more effectively and efficiently. In this paper, we propose to classify questions into three categories according to their underlying user intent: subjective, objective, and social. In order to identify the user intent of a new question, we build a predictive model through machine learning based on both text and metadata features. Our investigation reveals that these two types of features are conditionally independent and each of them is sufficient for prediction. Therefore they can be exploited as two views in co-training -- a semi-supervised learning framework -- to make use of a large amount of unlabelled questions, in addition to the small set of manually labelled questions, for enhanced question classification. The preliminary experimental results show that co-training works significantly better than simply pooling these two types of features together.
Churn prediction in new users of Yahoo! answers BIBAFull-Text 829-834
  Gideon Dror; Dan Pelleg; Oleg Rokhlenko; Idan Szpektor
One of the important targets of community-based question answering (CQA) services, such as Yahoo! Answers, Quora and Baidu Zhidao, is to maintain and even increase the number of active answerers, that is the users who provide answers to open questions. The reasoning is that they are the engine behind satisfied askers, which is the overall goal behind CQA. Yet, this task is not an easy one. Indeed, our empirical observation shows that many users provide just one or two answers and then leave. In this work we try to detect answerers that are about to quit, a task known as churn prediction, but unlike prior work, we focus on new users. To address the task of churn prediction in new users, we extract a variety of features to model the behavior of \YA{} users over the first week of their activity, including personal information, rate of activity, and social interaction with other users. Several classifiers trained on the data show that there is a statistically significant signal for discriminating between users who are likely to churn and those who are not. A detailed feature analysis shows that the two most important signals are the total number of answers given by the user, closely related to the motivation of the user, and attributes related to the amount of recognition given to the user, measured in counts of best answers, thumbs up and positive responses by the asker.

EMAIL'12 workshop 4

Email between private use and organizational purpose BIBAFull-Text 837-840
  Uwe V. Riss
Emails have become an eminent source of personal and organizational information. They are not only used for personal communication but also for the management of information and the coordination of activities within organizations. Email traffic also exhibits the social networks existing in organizations. However, the central problem, which we still face, is how to tap this rich source appropriately. Main problems in this respect are the personal character of emails (their privacy) and the mainly unstructured character of their contents. Since these two features are essential success factors for the use of email they cannot be simply ignored. Meanwhile there are various approaches to recover the hidden treasure and make the contained information available to information and process management. For example, semantic or mining technologies play a prominent role in this attempt. The paper gives an overview of different strategies to make organizational use of emails, also touching the role of privacy.
Emails as graph: relation discovery in email archive BIBAFull-Text 841-846
  Michal Laclavík; Stefan Dlugolinský; Martin Seleng; Marek Ciglan; Ladislav Hluchý
In this paper, we present an approach for representing an email archive in the form of a network, capturing the communication among users and relations among the entities extracted from the textual part of the email messages. We showcase the method on the Enron email corpus, from which we extract various entities and a social network. The extracted named entities (NE), such as people, email addresses and telephone numbers, are organized in a graph along with the emails in which they were found. The edges in the graph indicate relations between NEs and represent a co-occurrence in the same email part, paragraph, sentence or a composite NE. We study mathematical properties of the graphs so created and describe our hands-on experience with the processing of such structures. Enron Graph corpus contains a few million nodes and is large enough for experimenting with various graph-querying techniques, e.g. graph traversal or spread of activation. Due to its size, the exploitation of traditional graph processing libraries might be problematic as they keep the whole structure in the memory. We describe our experience with the management of such data and with the relation discovery among the extracted entities. The described experience might be valuable for practitioners and highlights several research challenges.
Interpreting contact details out of e-mail signature blocks BIBAFull-Text 847-850
  Gaëlle Recourcé
This paper describes a fully automated process of address book enrichment by means of information extraction in e-mail signature blocks. The main issues we tackle are signature block detection, named entities tagging, mapping with a specific person, standardizing the details and auto-updating of the address book. We adopted a symbolic approach for NLP modules. We describe how the process was designed to handle multiple-type of errors (human or computer-driven) while aiming at 100% precision rate. Last, we tackle the question of automatic updating confronted to users rights over their own data.
Context-sensitive business process support based on emails BIBAFull-Text 851-856
  Thomas Burkhart; Dirk Werth; Peter Loos
In many companies, a majority of business processes take place via email communication. Large enterprises have the possibility to operate enterprise systems for a successful business process management. However, these systems are not appropriate for SMEs, which are the most common enterprise type in Europe. Thus, the European research project Commius addresses the special needs of SMEs and characteristics of email communication, namely highly flexibility and unstructuredness. Commius turns the existing email-system into a structured process management framework. Each incoming email is autonomously matched to the corresponding business process and enhanced by proactive annotations. These context-sensitive annotations include recommendations for the most suitable following process steps. An underlying, self-adjusting recommendation model ensures most appropriate recommendations by observing the actual user behavior. This implies that the proposed process course is in no way obligatory. To provide a high degree of flexibility, any deviation from the given process structure is allowed.
Full-text search in email archives using social evaluation, attached and linked resources BIBAFull-Text 857-860
  Vojtech Juhász
Emails are important tools for communication and cooperation, they contain large amount of information and connections to knowledge and data sources. Because of this, it is very important to improve the efficiency of their processing. This paper describes an email search system which integrates full-text search with social search while processing also the attached and linked resources. The project described in this paper is still in progress. Due to this fact, some proposed parts of the system are not implemented and also not proven yet. The proposed equation for determining the social importance of an email has also to be tuned during the last phases of the development and the evaluation phase. The already implemented part of the system includes content extraction from the email messages, attached and linked resources and also the textual search and social relation extraction is implemented. The next phase of the development includes tuning of the social evaluation and it's integration with textual search.

AdMIRe'12 workshop 6

Context-aware music recommender systems: workshop keynote abstract BIBKFull-Text 865-866
  Francesco Ricci
Keywords: Music recommender systems, context awareness, mobile services, tags
Data gathering for a culture specific approach in MIR BIBAFull-Text 867-868
  Xavier Serra
In this paper we describe the data gathering work done within a large research project, CompMusic, which emphasizes a culture specific approach in the automatic description of several world music repertoires. Currently we are focusing on the Hindustani (North India), Carnatic (South India) and Turkish-makam (Turkey) music traditions. The selection and organization of the data to be processed for the characterization of each of these traditions is of the utmost importance.
Music retagging using label propagation and robust principal component analysis BIBAFull-Text 869-876
  Yi-Hsuan Yang; Dmitry Bogdanov; Perfecto Herrera; Mohamed Sordo
The emergence of social tagging websites such as Last.fm has provided new opportunities for learning computational models that automatically tag music. Researchers typically obtain music tags from the Internet and use them to construct machine learning models. Nevertheless, such tags are usually noisy and sparse. In this paper, we present a preliminary study that aims at refining (retagging) social tags by exploiting the content similarity between tracks and the semantic redundancy of the track-tag matrix. The evaluated algorithms include a graph-based label propagation method that is often used in semi-supervised learning and a robust principal component analysis (PCA) algorithm that has led to state-of-the-art results in matrix completion. The results indicate that robust PCA with content similarity constraint is particularly effective; it improves the robustness of tagging against three types of synthetic errors and boosts the recall rate of music auto-tagging by 7% in a real-world setting.
Mining microblogs to infer music artist similarity and cultural listening patterns BIBAFull-Text 877-886
  Markus Schedl; David Hauger
This paper aims at leveraging microblogs to address two challenges in music information retrieval (MIR), similarity estimation between music artists and inferring typical listening patterns at different granularity levels (city, country, global). From two collections of several million microblogs, which we gathered over ten months, music-related information is extracted and statistically analyzed. We propose and evaluate four co-occurrence-based methods to compute artist similarity scores. Moreover, we derive and analyze culture-specific music listening patterns to investigate the diversity of listening behavior around the world.
Melody, bass line, and harmony representations for music version identification BIBAFull-Text 887-894
  Justin Salamon; Joan Serrà; Emilia Gómez
In this paper we compare the use of different musical representations for the task of version identification (i.e. retrieving alternative performances of the same musical piece). We automatically compute descriptors representing the melody and bass line using a state-of-the-art melody extraction algorithm, and compare them to a harmony-based descriptor. The similarity of descriptor sequences is computed using a dynamic programming algorithm based on nonlinear time series analysis which has been successfully used for version identification with harmony descriptors. After evaluating the accuracy of individual descriptors, we assess whether performance can be improved by descriptor fusion, for which we apply a classification approach, comparing different classification algorithms. We show that both melody and bass line descriptors carry useful information for version identification, and that combining them increases version detection accuracy. Whilst harmony remains the most reliable musical representation for version identification, we demonstrate how in some cases performance can be improved by combining it with melody and bass line descriptions. Finally, we identify some of the limitations of the proposed descriptor fusion approach, and discuss directions for future research.
Power-law distribution in encoded MFCC frames of speech, music, and environmental sound signals BIBAFull-Text 895-902
  Martín Haro; Joan Serrà; Álvaro Corral; Perfecto Herrera
Many sound-related applications use Mel-Frequency Cepstral Coefficients (MFCC) to describe audio timbral content. Most of the research efforts dealing with MFCCs have been focused on the study of different classification and clustering algorithms, the use of complementary audio descriptors, or the effect of different distance measures. The goal of this paper is to focus on the statistical properties of the MFCC descriptor itself. For that purpose, we use a simple encoding process that maps a short-time MFCC vector to a dictionary of binary code-words. We study and characterize the rank-frequency distribution of such MFCC code-words, considering speech, music, and environmental sound sources. We show that, regardless of the sound source, MFCC code-words follow a shifted power-law distribution. This implies that there are a few code-words that occur very frequently and many that happen rarely. We also observe that the inner structure of the most frequent code-words has characteristic patterns. For instance, close MFCC coefficients tend to have similar quantization values in the case of music signals. Finally, we study the rank-frequency distributions of individual music recordings and show that they present the same type of heavy-tailed distribution as found in the large-scale databases. This fact is exploited in two supervised semantic inference tasks: genre and instrument classification. In particular, we obtain similar classification results as the ones obtained by considering all frames in the recordings by just using 50 (properly selected) frames. Beyond this particular example, we believe that the fact that MFCC frames follow a power-law distribution could potentially have important implications for future audio-based applications.
Creating a large-scale searchable digital collection from printed music materials BIBAFull-Text 903-908
  Andrew Hankinson; John Ashley Burgoyne; Gabriel Vigliensoni; Ichiro Fujinaga
In this paper we present our work towards developing a large-scale web application for digitizing, recognizing (via optical music recognition), correcting, displaying, and searching printed music texts. We present the results of a recently completed prototype implementation of our workflow process, from document capture to presentation on the web. We discuss a number of lessons learned from this prototype. Finally, we present some open-source Web 2.0 tools developed to provide essential infrastructure components for making searchable printed music collections available online. Our hope is that these experiences and tools will help in creating next-generation globally accessible digital music libraries.
The million song dataset challenge BIBAFull-Text 909-916
  Brian McFee; Thierry Bertin-Mahieux; Daniel P. W. Ellis; Gert R. G. Lanckriet
We introduce the Million Song Dataset Challenge: a large-scale, personalized music recommendation challenge, where the goal is to predict the songs that a user will listen to, given both the user's listening history and full information (including meta-data and content analysis) for all songs. We explain the taste profile data, our goals and design choices in creating the challenge, and present baseline results using simple, off-the-shelf recommendation algorithms.
Towards minimal test collections for evaluation of audio music similarity and retrieval BIBAFull-Text 917-924
  Julián Urbano; Markus Schedl
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
Combining usage and content in an online music recommendation system for music in the long-tail BIBAFull-Text 925-930
  Marcos Aurélio Domingues; Fabien Gouyon; Alípio Mário Jorge; José Paulo Leal; João Vinagre; Luís Lemos; Mohamed Sordo
In this paper we propose a hybrid music recommender system, which combines usage and content data. We describe an online evaluation experiment performed in real time on a commercial music web site, specialised in content from the very long tail of music content. We compare it against two stand-alone recommenders, the first system based on usage and the second one based on content data. The results show that the proposed hybrid recommender shows advantages with respect to usage- and content-based systems, namely, higher user absolute acceptance rate, higher user activity rate and higher user loyalty.
Adapting similarity on the MagnaTagATune database: effects of model and feature choices BIBAFull-Text 931-936
  Daniel Wolff; Tillman Weyde
Predicting user's tastes on music has become crucial for a competitive music recommendation systems, and perceived similarity plays an influential role in this. MIR currently turns towards making recommendation systems adaptive to user preferences and context. Here, we consider the particular task of adapting music similarity measures to user voting data. This work builds on and responds to previous publications based on the MagnaTagATune dataset. We have reproduced the similarity dataset presented by Stober and Nürnberger at AMR 2011 to enable a comparison of approaches. On this dataset, we compare their two-level approach, defining similarity measures on individual facets and combining them in a linear model, to the Metric Learning to Rank (MLR) algorithm. MLR adapts a similarity measure that operates directly on low-level features to the user data. We compare the different algorithms, features and parameter spaces with regards to minimising constraint violations. Furthermore, the effectiveness of the MLR algorithm in generalising to unknown data is evaluated on this dataset. We also explore the effects of feature choice. Here, we find that the binary genre data shows little correlation with the similarity data, but combined with audio features it clearly improves generalisation.

MultiAPro'12 workshop 7

User profile integration made easy: model-driven extraction and transformation of social network schemas BIBAFull-Text 939-948
  Martin Wischenbart; Stefan Mitsch; Elisabeth Kapsammer; Angelika Kusel; Birgit Pröll; Werner Retschitzegger; Wieland Schwinger; Johannes Schönböck; Manuel Wimmer; Stephan Lechner
User profile integration from multiple social networks is indispensable for gaining a comprehensive view on users. Although current social networks provide access to user profile data via dedicated APIs, they fail to provide accurate schema information, which aggravates the integration of user profiles, and not least the adaptation of applications in the face of schema evolution. To alleviate these problems, this paper presents, firstly, a semi-automatic approach to extract schema information from instance data. Secondly, transformations of the derived schemas to different technical spaces are utilized, thereby allowing, amongst other benefits, the application of established integration tools and methods. Finally, as a case study, schemas are derived for Facebook, Google+, and LinkedIn. The resulting schemas are analyzed (i) for completeness and correctness according to the documentation, and (ii) for semantic overlaps and heterogeneities amongst each other, building the basis for future user profile integration.
Multi-application profile updates propagation: a semantic layer to improve mapping between applications BIBAFull-Text 949-958
  Nadia Bennani; Max Chevalier; Elöd Egyed-Zsigmond; Gilles Hubert; Marco Viviani
In the field of multi-application personalization, several techniques have been proposed to support user modeling for user data management across different applications. Many of them are based on data reconciliation techniques often implying the concepts of static ontologies and generic user data models. None of them have sufficiently investigated two main issues related to user modeling: (1) profile definition in order to allow every application to build their own view of users while promoting the sharing of these profiles and (2) profile evolution over time in order to avoid data inconsistency and the subsequent loss of income for web-site users and companies. In this paper, we conduct work and propose separated solutions for every issue. We propose a flexible user modeling system, not imposing any fixed user model whom different applications should conform to, but based on the concept of mapping among applications (and mapping functions among their user attributes). We focus in particular on the management of user profile data propagation, as a way to reduce the amount of inconsistent user profile information over several applications.
   A second goal of this paper is to illustrate, in this context, the benefit obtained by the integration of a Semantic Layer that can help application designers to automatically identify potential user attribute mappings between applications.
   This paper so illustrates a work-in-progress work where two complementary approaches are integrated to improve a main goal: managing multi-application user profiles in a semi-automatic manner.
Personalised placement in networked video BIBAFull-Text 959-968
  Jeremy D. Foss; Benedita Malheiro; Juan-Carlos Burguillo
Personalised video can be achieved by inserting objects into a video play-out according to the viewer's profile. Content which has been authored and produced for general broadcast can take on additional commercial service features when personalised either for individual viewers or for groups of viewers participating in entertainment, training, gaming or informational activities. Although several scenarios and use-cases can be envisaged, we are focussed on the application of personalised product placement. Targeted advertising and product placement are currently garnering intense interest in the commercial networked media industries. Personalisation of product placement is a relevant and timely service for next generation online marketing and advertising and for many other revenue generating interactive services. This paper discusses the acquisition and insertion of media objects into a TV video play-out stream where the objects are determined by the profile of the viewer. The technology is based on MPEG-4 standards using object based video and MPEG-7 for metadata. No proprietary technology or protocol is proposed. To trade the objects into the video play-out, a Software-as-a-Service brokerage platform based on intelligent agent technology is adopted. Agencies, libraries and service providers are represented in a commercial negotiation to facilitate the contractual selection and usage of objects to be inserted into the video play-out.
A user profile modelling using social annotations: a survey BIBAFull-Text 969-976
  Manel Mezghani; Corinne Amel Zayani; Ikram Amous; Faiez Gargouri
As social networks are growing in terms of the number of users, resources and interactions; the user may be lost or unable to find useful information. Social elements could avoid this disorientation like the social annotations (tags) which become more and more popular and contribute to avoid the disorientation of the user. Representing a user based on these social annotations has showed their utility in reflecting an accurate user profile which could be used for a recommendation purpose. In this paper, we give a state of the art of characteristics of social user and techniques which model and update a tag-based profile. We show how to treat social annotations and the utility of modelling tag-based profiles for recommendation purposes.
Towards an interoperable device profile containing rich user constraints BIBAFull-Text 977-986
  Cédric Dromzée; Sébastien Laborie; Philippe Roose
Currently, multimedia documents can be accessed at anytime and anywhere with a wide variety of mobile devices, e.g., laptops, smartphones, tablets. Obviously, platforms heterogeneity, user's preferences and context variations require documents adaptation according to execution constraints, e.g., audio contents may not be played while a user is participating at a meeting. Current context modeling languages do not handle such a real life user constraints. They generally list multiple information values that are interpreted by adaptation processes in order to deduce implicitly such high-level constraints. This paper overcomes this limitation by proposing a novel context modeling approach based on services where context information are linked according to explicit high-level constraints. In order to validate our proposal, we have used Semantic Web technologies by specifying RDF profiles and experiment their usage on several platforms.

LSNA'12 workshop 8

From network mining to large scale business networks BIBAFull-Text 989-996
  Daniel Ritter
The vision of Large Scale Network Analysis (LSNA) states on large amounts of network data, which are produced by social media applications like Facebook, Twitter, and the competitive domain of biological networks as well as their needs for network data extraction and analysis. That raises data management challenges which are addressed by biological, data mining and linked (web) data management communities. So far, mainly these domains were considered when identifying research topics and measuring approaches and progress. We argue that an important domain, the Business Network Management (BNM), representing business and (technical) integration data, implicitly linked and available in enterprises, has been neglected. Not only do enterprises need visibilities into their business networks, they need ad-hoc analysis capabilities on them. In this paper, we introduce BNM as domain, which comes with large scale network data. We discuss how linked business data can be made explicit by what we called Network Mining (NM) from dynamic, heterogeneous enterprise environments to combine it to a (cross-) enterprise linked business data network and state on its different facets w.r.t large network analysis and highlight challenges and opportunities.
Role-dynamics: fast mining of large dynamic networks BIBAFull-Text 997-1006
  Ryan Rossi; Brian Gallagher; Jennifer Neville; Keith Henderson
To understand the structural dynamics of a large-scale social, biological or technological network, it may be useful to discover behavioral roles representing the main connectivity patterns present over time. In this paper, we propose a scalable non-parametric approach to automatically learn the structural dynamics of the network and individual nodes. Roles may represent structural or behavioral patterns such as the center of a star, peripheral nodes, or bridge nodes that connect different communities. Our novel approach learns the appropriate structural role dynamics for any arbitrary network and tracks the changes over time. In particular, we uncover the specific global network dynamics and the local node dynamics of a technological, communication, and social network. We identify interesting node and network patterns such as stationary and non-stationary roles, spikes/steps in role-memberships (perhaps indicating anomalies), increasing/decreasing role trends, among many others. Our results indicate that the nodes in each of these networks have distinct connectivity patterns that are non-stationary and evolve considerably over time. Overall, the experiments demonstrate the effectiveness of our approach for fast mining and tracking of the dynamics in large networks. Furthermore, the dynamic structural representation provides a basis for building more sophisticated models and tools that are fast for exploring large dynamic networks.
A fast algorithm to find all high degree vertices in power law graphs BIBAFull-Text 1007-1016
  Colin Cooper; Tomasz Radzik; Yiannis Siantos
Sampling from large graphs is an area which is of great interest, particularly with the recent emergence of huge structures such as Online Social Networks. These often contain hundreds of millions of vertices and billions of edges. The large size of these networks makes it computationally expensive to obtain structural properties of the underlying graph by exhaustive search. If we can estimate these properties by taking small but representative samples from the network, then size is no longer a problem.
   In this paper we develop an analysis of random walks, a commonly used method of sampling from networks. We present a method of biasing the random walk to acquire a complete sample of high degree vertices of social networks, or similar graphs. The preferential attachment model is a common method to generate graphs with a power law degree sequence. For this model, we prove that this sampling method is successful with high probability. We also make experimental studies of the method on various real world networks.
   For t-vertex graphs G(t) generated by a preferential attachment process, we analyze a biased random walk which makes transitions along undirected edges {x,y} proportional to [d(x)d(y)]b, where d(x) is the degree of vertex x and b > 0 is a constant parameter. Let S(a) be the set of all vertices of degree at least ta in G(t). We show that for some b approx 2/3, if the biased random walk starts at an arbitrary vertex of S(a), then with high probability the set S(a) can be discovered completely in O(t1-(4/3)a+d) steps, where d is a very small positive constant. The notation O ignores poly-log t factors.
   The preferential attachment process generates graphs with power law 3, so the above example is a special case of this result. For graphs with degree sequence power law c>2 generated by a generalized preferential attachment process, a random walk with transitions along undirected edges {x,y} proportional to (d(x)d(y))(c-2)/2, discovers the set S(a) completely in O(t1-a(c-2)+d) steps with high probability. The cover time of the graph is O(t).
   Our results say that if we search preferential attachment graphs with a bias b=(c-2)/2 proportional to the power law c then, (i) we can find all high degree vertices quickly, and (ii) the time to discover all vertices is not much higher than in the case of a simple random walk. We conduct experimental tests on generated networks and real-world networks, which confirm these two properties.
Harnessing user library statistics for research evaluation and knowledge domain visualization BIBAFull-Text 1017-1024
  Peter Kraker; Christian Körner; Kris Jack; Michael Granitzer
Social reference management systems provide a wealth of information that can be used for the analysis of science. In this paper, we examine whether user library statistics can produce meaningful results with regards to science evaluation and knowledge domain visualization. We are conducting two empirical studies, using a sample of library data from Mendeley, the world's largest social reference management system. Based on the occurrence of references in users' libraries, we perform a large-scale impact factor analysis and an exploratory co-readership analysis. Our preliminary findings indicate that the analysis of user library statistics can produce accurate, timely, and content-rich results. We find that there is a significant relationship between the impact factor and the occurrence of references in libraries. Using a knowledge domain visualization based on co-occurrence measures, we are able to identify two areas of topics within the emerging field of technology-enhanced learning.
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques BIBAFull-Text 1025-1034
  Matthias Keller; Martin Nussbaumer
The foundation of almost all web sites' information architecture is a hierarchical content organization. Thus information architects put much effort in designing taxonomies that structure the content in a comprehensible and sound way. The taxonomies are obvious to human users from the site's system of main and sub menus. But current methods of web structure mining are not able to extract these central aspects of the information architecture. This is because they cannot interpret the visual encoding to recognize menus and their rank as humans do. In this paper we show that a web site's main navigation system can not only be distinguished by visual features but also by certain structural characteristics of the HTML tree and the web graph. We have developed a reliable and scalable solution that solves the problem of extracting menus for mining the information architecture. The novel MenuMiner-algorithm allows retrieving the original content organization of large-scale web sites. These data are very valuable for many applications, e.g. the presentation of search results. In an experiment we applied the method for finding site boundaries within a large domain. The evaluation showed that the method reliably delivers menus and site boundaries where other current approaches fail.
Large scale microblog mining using distributed MB-LDA BIBAFull-Text 1035-1042
  Chenyi Zhang; Jianling Sun
In the information explosion era, large scale data processing and mining is a hot issue. As microblog grows more popular, microblog services have become information provider on a web scale, so researches on microblog begin to focus more on its content mining than solely user's relationship analysis before. Although traditional text mining methods have been studied well, no algorithm is designed specially for microblog data, which contain structured information on social network besides plain text. In this paper, we introduce a novel probabilistic generative model MicroBlog-Latent Dirichlet Allocation (MB-LDA), which takes both contactor relevance relation and document relevance relation into consideration to improve topic mining in microblogs. Through Gibbs sampling for approximate inference of our model, MB-LDA can discover not only the topics of microblogs, but also the topics focused by contactors. When faced with large datasets, traditional techniques on single node become less practical within limited resources. So we present distributed MB-LDA in MapReduce framework in order to process large scale microblogs with high scalability. Furthermore, we apply a performance model to optimize the execution time by tuning the number of mappers and reducers. Experimental results on actual dataset show MB-LDA outperforms the baseline of LDA and distributed MB-LDA offers an effective solution to topic mining for large scale microblogs.
k-Centralities: local approximations of global measures based on shortest paths BIBAFull-Text 1043-1050
  Jürgen Pfeffer; Kathleen M. Carley
A lot of centrality measures have been developed to analyze different aspects of importance. Some of the most popular centrality measures (e.g. betweenness centrality, closeness centrality) are based on the calculation of shortest paths. This characteristic limits the applicability of these measures for larger networks. In this article we elaborate on the idea of bounded-distance shortest paths calculations. We claim criteria for k-centrality measures and we introduce one algorithm for calculating both betweenness and closeness based centralities. We also present normalizations for these measures. We show that k-centrality measures are good approximations for the corresponding centrality measures by achieving a tremendous gain of calculation time and also having linear calculation complexity O(n) for networks with constant average degree. This allows researchers to approximate centrality measures based on shortest paths for networks with millions of nodes or with high frequency in dynamically changing networks.
Building a role search engine for social media BIBAFull-Text 1051-1060
  Vanesa Junquero-Trabado; David Dominguez-Sal
A social role is a set of characteristics that describe the behavior of individuals and their interactions between them within a social context. In this paper, we describe the architecture of a search engine for detecting roles in a social network. Our approach, based on indexed clusters, gives the user the possibility to define the roles interactively during a search session and retrieve the users for that role in milliseconds. We found that role selection strategies based on selecting people deviating from the average standards provides flexible query expressions and high quality results.

SWCS'12 workshop 9

Wikidata: a new platform for collaborative data collection BIBAFull-Text 1063-1064
  Denny Vrandecic
This year, Wikimedia starts to build a new platform for the collaborative acquisition and maintenance of structured data: Wikidata. Wikidata's prime purpose is to be used within the other Wikimedia projects, like Wikipedia, to provide well-maintained, high-quality data. The nature and requirements of the Wikimedia projects require to develop a few novel, or at least unusual features for Wikidata: Wikidata will be a secondary database, i.e. instead of containing facts it will contain references for facts. It will be fully internationalized. It will contain inconsistent and contradictory facts, in order to represent the diversity of knowledge about a given entity.
User assistance for collaborative knowledge construction BIBAFull-Text 1065-1074
  Pierre-Antoine Champin; Amélie Cordier; Elise Lavoué; Marie Lefevre; Hala Skaf-Molli
In this paper, we study tools for providing assistance to users in distributed spaces. More precisely, we focus on the activity of collaborative construction of knowledge, supported by a network of distributed semantic wikis. Assisting the users in such an activity is made necessary mainly by two factors: the inherent complexity of the tools supporting that activity, and the collaborative nature of the activity, involving many interactions between users. In this paper we focus on the second aspect. For this, we propose to build an assistance tool based on users interaction traces. This tool will provide a contextualized assistance by leveraging the valuable knowledge contained in traces. We discuss the issue of assistance in our context and we show the different types of assistance that we intend to provide through three scenarios. We highlight research questions raised by this preliminary study.
Knowledge continuous integration process (K-CIP) BIBAFull-Text 1075-1082
  Hala Skaf-Molli; Emmanuel Desmontils; Emmanuel Nauer; Gérôme Canals; Amélie Cordier; Marie Lefevre; Pascal Molli; Yannick Toussaint
Social semantic web creates read/write spaces where users and smart agents collaborate to produce knowledge readable by humans and machines. An important issue concerns the ontology evolution and evaluation in man-machine collaboration. How to perform a change on ontologies in a social semantic space that currently use these ontologies through requests? In this paper, we propose to implement a continuous knowledge integration process named K-CIP. We take advantage of man-machine collaboration to transform feedback of people into tests. This paper presents how K-CIP can be deployed to allow fruitful man-machine collaboration in the context of the Wikitaaable system.
Linking justifications in the collaborative semantic web applications BIBAFull-Text 1083-1090
  Rakebul Hasan; Fabien Gandon
Collaborative Semantic Web applications produce ever changing interlinked Semantic Web data. Applications that utilize these data to obtain their results should provide explanations about how the results are obtained in order to ensure the effectiveness and increase the user acceptance of these applications. Justifications providing meta information about why a conclusion has been reached enable generation of such explanations. We present an encoding approach for justifications in a distributed environment focusing on the collaborative platforms. We discuss the usefulness of linking justifications across the Web. We introduce a vocabulary for encoding justifications in a distributed environment and provide examples of our encoding approach.
Synchronizing semantic stores with commutative replicated data types BIBAFull-Text 1091-1096
  Luis Daniel Ibáñez; Hala Skaf-Molli; Pascal Molli; Olivier Corby
Social semantic web technologies led to huge amounts of data and information being available. The production of knowledge from this information is challenging, and major efforts, like DBpedia, has been done to make it reality. Linked data provides interconnection between this information, extending the scope of the knowledge production. The knowledge construction between decentralized sources in the web follows a co-evolution scheme, where knowledge is generated collaboratively and continuously. Sources are also autonomous, meaning that they can use and publish only the information they want. The updating of sources with this criteria is intimately related with the problem of synchronization, and the consistency between all the replicas managed. Recently, a new family of algorithms called Commutative Replicated Data Types have emerged for ensuring eventual consistency in highly dynamic environments. In this paper, we define SU-Set, a CRDT for RDF-Graph that supports SPARQL Update 1.1 operations.
Building consensus via a semantic web collaborative space BIBAFull-Text 1097-1106
  George Anadiotis; Konstantinos Kafentzis; Iannis Pavlopoulos; Adam Westerski
In this paper we outline the design and implementation of the eDialogos Consensus process and platform to support wide-scale collaborative decision making. We present the design space and choices made and perform a conceptual alignment of the domains this space entails, based on the use of the eDialogos Consensus ontology as a crystallization point for platform design and implementation as well as interoperability with existing solutions. We also present a metric for calculating agreement on the issues under debate in the platform, incorporating argumentation structure and user feedback.
Improving Wikipedia with DBpedia BIBAFull-Text 1107-1112
  Diego Torres; Pascal Molli; Hala Skaf-Molli; Alicia Diaz
DBpedia is the semantic mirror of Wikipedia. DBpedia extracts information from Wikipedia and stores it in a semantic knowledge base. This semantic feature allows complex semantic queries, which could infer new relations that are missing in Wikipedia. This is an interesting source of knowledge to increase Wikipedia content. But, what is the best way to add these new relations following the Wikipedia conventions? In this paper, we propose a path indexing algorithm (PIA) which takes the resulting set of a DBPedia query and discovers the best representative path in Wikipedia. We evaluate the algorithm with real data sets from DBpedia.
Man-machine collaboration to acquire cooking adaptation knowledge for the TAAABLE case-based reasoning system BIBAFull-Text 1113-1120
  Amélie Cordier; Emmanuelle Gaillard; Emmanuel Nauer
This paper shows how humans and machines can better collaborate to acquire adaptation knowledge (AK) in the framework of a case-based reasoning (CBR) system whose knowledge is encoded in a semantic wiki. Automatic processes like the CBR reasoning process itself, or specific tools for acquiring AK are integrated as wiki extensions. These tools and processes are combined on purpose to collect AK. Users are at the center of our approach, as they are in a classical wiki, but they will now benefit from automatic tools for helping them to feed the wiki. In particular, the CBR system, which is currently only a consumer for the knowledge encoded in the semantic wiki, will also be used for producing knowledge for the wiki. A use case in the domain of cooking is given to exemplify the man-machine collaboration.
Community: issues, definitions, and operationalization on the web BIBAFull-Text 1121-1130
  Guo Zhang; Elin K. Jacob
This paper addresses the concepts of community and online community and discusses the physical, functional, and symbolic characteristics of a community that have formed the basis for traditional definitions. It applies a four-dimensional perspective of space and place (i.e., shape, structure, context, and experience) as a framework for refining the definition of traditional offline communities and for developing a definition of online communities that can be effectively operationalized. The methods and quantitative measures of social network analysis are proposed as appropriate tools for investigating the nature and function of communities because they can be used to quantify the typically subjective social phenomena generally associated with communities.

MSND'12 workshop 10

Business session "social media and news BIBAFull-Text 1135-1136
  Jochen Spangenberg
The workshop also includes a business section that will focus on aspects of Social Media in the News domain. Panelists with expertise in innovation management, news provision, journalism and market developments will discuss some of the challenges of and opportunities for the news sector with regards to Social Media. This part of the workshop is organized and brought to you by the SocialSensor project.
Graph embedding on spheres and its application to visualization of information diffusion data BIBAFull-Text 1137-1144
  Kazumi Saito; Masahiro Kimura; Kouzou Ohara; Hiroshi Motoda
We address the problem of visualizing structure of undirected graphs that have a value associated with each node into a K-dimensional Euclidean space in such a way that 1) the length of the point vector in this space is equal to the value assigned to the node and 2) nodes that are connected are placed as close as possible to each other in the space and nodes not connected are placed as far apart as possible from each other. The problem is reduced to K-dimensional spherical embedding with a proper objective function. The existing spherical embedding method can handle only a bipartite graph and cannot be used for this purpose. The other graph embedding methods, e.g., multi-dimensional scaling, spring force embedding methods, etc., cannot handle the value constraint and thus are not applicable, either. We propose a very efficient algorithm based on a power iteration that employs the double-centering operations. We apply the method to visualize the information diffusion process over a social network by assigning the node activation time to the node value, and compare the results with the other visualization methods. The results applied to four real world networks indicate that the proposed method can visualize the diffusion dynamics which the other methods cannot and the role of important nodes, e.g. mediator, more naturally than the other methods.
A predictive model for the temporal dynamics of information diffusion in online social networks BIBAFull-Text 1145-1152
  Adrien Guille; Hakim Hacid
Today, online social networks have become powerful tools for the spread of information. They facilitate the rapid and large-scale propagation of content and the consequences of an information -- whether it is favorable or not to someone, false or true -- can then take considerable proportions. Therefore it is essential to provide means to analyze the phenomenon of information dissemination in such networks. Many recent studies have addressed the modeling of the process of information diffusion, from a topological point of view and in a theoretical perspective, but we still know little about the factors involved in it. With the assumption that the dynamics of the spreading process at the macroscopic level is explained by interactions at microscopic level between pairs of users and the topology of their interconnections, we propose a practical solution which aims to predict the temporal dynamics of diffusion in social networks. Our approach is based on machine learning techniques and the inference of time-dependent diffusion probabilities from a multidimensional analysis of individual behaviors. Experimental results on a real dataset extracted from Twitter show the interest and effectiveness of the proposed approach as well as interesting recommendations for future investigation.
Targeting online communities to maximise information diffusion BIBAFull-Text 1153-1160
  Václav Belák; Samantha Lam; Conor Hayes
In recent years, many companies have started to utilise online social communities as a means of communicating with and targeting their employees and customers. Such online communities include discussion fora which are driven by the conversational activity of users. For example, users may respond to certain ideas as a result of the influence of their neighbours in the underlying social network. We analyse such influence to target communities rather than individual actors because information is usually shared with the community and not just with individual users. In this paper, we study information diffusion across communities and argue that some communities are more suitable for maximising spread than others. In order to achieve this, we develop a set of novel measures for cross-community influence, and show that it outperforms other targeting strategies on 51 weeks of data of the largest Irish online discussion system, Boards.ie.
Identifying communicator roles in Twitter BIBAFull-Text 1161-1168
  Ramine Tinati; Leslie Carr; Wendy Hall; Jonny Bentwood
Twitter has redefined the way social activities can be coordinated; used for mobilizing people during natural disasters, studying health epidemics, and recently, as a communication platform during social and political change. As a large scale system, the volume of data transmitted per day presents Twitter users with a problem: how can valuable content be distilled from the back chatter, how can the providers of valuable information be promoted, and ultimately how can influential individuals be identified? To tackle this, we have developed a model based upon the Twitter message exchange which enables us to analyze conversations around specific topics and identify key players in a conversation. A working implementation of the model helps categorize Twitter users by specific roles based on their dynamic communication behavior rather than an analysis of their static friendship network. This provides a method of identifying users who are potentially producers or distributors of valuable knowledge.
File diffusion in a dynamic peer-to-peer network BIBAFull-Text 1169-1172
  Alice Albano; Jean-Loup Guillaume; Bénédicte Le Grand
Many studies have been made on diffusion in the field of epidemiology, and in the last few years, the development of social networking has induced new types of diffusion. In this paper, we focus on file diffusion on a peer-to-peer dynamic network using eDonkey protocol. On this network, we observe a linear behavior of the actual file diffusion. This result is interesting, because most diffusion models exhibit exponential behaviors. In this paper, we propose a new model of diffusion, based on the SI (Susceptible / Infected) model, which produces results close to the linear behavior of the observed diffusion. We then justify the linearity of this model, and we study its behavior in more details.
Community cores in evolving networks BIBAFull-Text 1173-1180
  Massoud Seifi; Jean-Loup Guillaume
Community structure is a key property of complex networks. Many algorithms have been proposed to automatically detect communities in static networks but few studies have considered the detection and tracking of communities in an evolving network. Tracking the evolution of a given community over time requires a clustering algorithm that produces stable clusters. However, most community detection algorithms are very unstable and therefore unusable for evolving networks. In this paper, we apply the methodology proposed in [seifi2012] to detect what we call community cores in evolving networks. We show that cores are much more stable than "classical" communities and that we can overcome the disadvantages of the stabilized methods.
Watch me playing, i am a professional: a first study on video game live streaming BIBAFull-Text 1181-1188
  Mehdi Kaytoue; Arlei Silva; Loïc Cerf; Wagner, Jr. Meira; Chedy Raïssi
"Electronic-sport" (E-Sport) is now established as a new entertainment genre. More and more players enjoy streaming their games, which attract even more viewers. In fact, in a recent social study, casual players were found to prefer watching professional gamers rather than playing the game themselves. Within this context, advertising provides a significant source of revenue to the professional players, the casters (displaying other people's games) and the game streaming platforms. For this paper, we crawled, during more than 100 days, the most popular among such specialized platforms: Twitch.tv. Thanks to these gigabytes of data, we propose a first characterization of a new Web community, and we show, among other results, that the number of viewers of a streaming session evolves in a predictable way, that audience peaks of a game are explainable and that a Condorcet method can be used to sensibly rank the streamers by popularity. Last but not least, we hope that this paper will bring to light the study of E-Sport and its growing community. They indeed deserve the attention of industrial partners (for the large amount of money involved) and researchers (for interesting problems in social network dynamics, personalized recommendation, sentiment analysis, etc.).
Supervised rank aggregation approach for link prediction in complex networks BIBAFull-Text 1189-1196
  Manisha Pujari; Rushed Kanawati
In this paper we propose a new topological approach for link prediction in dynamic complex networks. The proposed approach applies a supervised rank aggregation method. This functions as follows: first we rank the list of unlinked nodes in a network at instant t according to different topological measures (nodes characteristics aggregation, nodes neighborhood based measures, distance based measures, etc). Each measure provides its own rank. Observing the network at instant t+1 where some new links appear, we weight each topological measure according to its performances in predicting these observed new links. These learned weights are then used in a modified version of classical computational social choice algorithms (such as Borda, Kemeny, etc) in order to have a model for predicting new links. We show the effectiveness of this approach through different experimentations applied to co-authorship networks extracted from the DBLP bibliographical database. Results we obtain, are also compared with the outcome of classical supervised machine learning based link prediction approaches applied to the same datasets.
Predicting information diffusion on social networks with partial knowledge BIBAFull-Text 1197-1204
  Anis Najar; Ludovic Denoyer; Patrick Gallinari
Models of information diffusion and propagation over large social media usually rely on a Close World Assumption: information can only propagate onto the network relational structure, it cannot come from external sources, the network structure is supposed fully known by the model. These assumptions are nonrealistic for many propagation processes extracted from Social Websites. We address the problem of predicting information propagation when the network diffusion structure is unknown and without making any closed world assumption. Instead of modeling a diffusion process, we propose to directly predict the final propagation state of the information over a whole user set. We describe a general model, able to learn predicting which users are the most likely to be contaminated by the information knowing an initial state of the network. Different instances are proposed and evaluated on artificial datasets.
Collective attention and the dynamics of group deals BIBAFull-Text 1205-1212
  Mao Ye; Thomas Sandholm; Chunyan Wang; Christina Aperjis; Bernardo A. Huberman
We present a study of the group purchasing behavior of daily deals in Groupon and LivingSocial and formulate a predictive dynamic model of collective attention for group buying behavior. Using large data sets from both Groupon and LivingSocial we show how the model is able to predict the success of group deals as a function of time.
   We find that Groupon deals are easier to predict accurately earlier in the deal lifecycle than LivingSocial deals due to the total number of deal purchases saturating quicker. One possible explanation for this is that the incentive to socially propagate a deal is based on an individual threshold in LivingSocial, whereas in Groupon it is based on a collective threshold which is reached very early. Furthermore, the personal benefit of propagating a deal is greater in LivingSocial.
Social networking trends and dynamics detection via a cloud-based framework design BIBAFull-Text 1213-1220
  Athena Vakali; Maria Giatsoglou; Stefanos Antaris
Social networking media generate huge content streams, which leverage, both academia and developers efforts in providing unbiased, powerful indications of users' opinion and interests. Here, we present Cloud4Trends, a framework for collecting and analyzing user generated content through microblogging and blogging applications, both separately and jointly, focused on certain geographical areas, towards the identification of the most significant topics using trend analysis techniques. The cloud computing paradigm appears to offer a significant benefit in order to make such applications viable considering that the massive data sizes produced daily impose the need of a scalable and powerful infrastructure. Cloud4Trends constitutes an efficient Cloud-based approach in order to solve the online trend tracking problem based on Web 2.0 sources. A detailed system architecture model is also proposed, which is largely based on a set of service modules developed within the VENUS-C research project to facilitate the deployment of research applications on Cloud infrastructures.
Effects of the recession on public mood in the UK BIBAFull-Text 1221-1226
  Thomas Lansdall-Welfare; Vasileios Lampos; Nello Cristianini
Large scale analysis of social media content allows for real time discovery of macro-scale patterns in public opinion and sentiment. In this paper we analyse a collection of 484 million tweets generated by more than 9.8 million users from the United Kingdom over the past 31 months, a period marked by economic downturn and some social tensions. Our findings, besides corroborating our choice of method for the detection of public mood, also present intriguing patterns that can be explained in terms of events and social changes. On the one hand, the time series we obtain show that periodic events such as Christmas and Halloween evoke similar mood patterns every year. On the other hand, we see that a significant increase in negative mood indicators coincide with the announcement of the cuts to public spending by the government, and that this effect is still lasting. We also detect events such as the riots of summer 2011, as well as a possible calming effect coinciding with the run up to the royal wedding.
Improving news ranking by community tweets BIBAFull-Text 1227-1232
  Xin Shuai; Xiaozhong Liu; Johan Bollen
Users frequently express their information needs by means of short and general queries that are difficult for ranking algorithms to interpret correctly. However, users' social contexts can offer important additional information about their information needs which can be leveraged by ranking algorithms to provide augmented, personalized results. Existing methods mostly rely on users' individual behavioral data such as clickstream and log data, but as a result suffer from data sparsity and privacy issues. Here, we propose a Community Tweets Voting Model (CTVM) to re-rank Google and Yahoo news search results on the basis of open, large-scale Twitter community data. Experimental results show that CTVM outperforms baseline rankings from Google and Yahoo for certain online communities. We propose an application scenario of CTVM and provide an agenda for further research.
TwitterEcho: a distributed focused crawler to support open research with Twitter data BIBAFull-Text 1233-1240
  Matko Boanjak; Eduardo Oliveira; José Martins; Eduarda Mendes Rodrigues; Luís Sarmento
Modern social network analysis relies on vast quantities of data to infer new knowledge about human relations and communication. In this paper we describe TwitterEcho, an open source Twitter crawler for supporting this kind of research, which is characterized by a modular distributed architecture. Our crawler enables researchers to continuously collect data from particular user communities, while respecting Twitter's imposed limits. We present the core modules of the crawling server, some of which were specifically designed to focus the crawl on the Portuguese Twittosphere. Additional modules can be easily implemented, thus changing the focus to a different community. Our evaluation of the system shows high crawling performance and coverage.
"Making sense of it all": an attempt to aid journalists in analysing and filtering user generated content BIBAFull-Text 1241-1246
  Sotiris Diplaris; Symeon Papadopoulos; Ioannis Kompatsiaris; Nicolaus Heise; Jochen Spangenberg; Nic Newman; Hakim Hacid
This position paper explores how journalists can embrace new ways of content provision and authoring, by aggregating and analyzing content gathered from Social Media. Current challenges in the news media industry are reviewed and a new system for capturing emerging knowledge from Social Media is described. Novel features that assist professional journalists in processing sheer amounts of Social Media information are presented with a reference to the technical requirements of the system. First implementation steps are also discussed, particularly focusing in event detection and user influence identification.