HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 08091011-111-212-112-213-113-214-114-215-115-2

Companion Proceedings of the 2014 International Conference on the World Wide Web

Fullname:Companion Proceedings of the 23rd International Conference on World Wide Web
Editors:Chin-Wan Chung; Andrei Broder; Kyuseok Shim; Torsten Suel
Location:Seoul, Korea
Dates:2014-Apr-07 to 2014-Apr-11
Standard No:ISBN: 978-1-4503-2745-9; ACM DL: Table of Contents; hcibib: WWW14-2
Links:Conference Website
  1. WWW 2014-04-07 Volume 2
    1. WWW 2014 PhD/doctoral presentations
    2. WWW 2014 demonstrations
    3. WWW 2014 tutorials
    4. WWW 2014 posters
    5. WWW 2014 websci track
    6. WWW 2014 developers' track
    7. WWW 2014 industry track
    8. Modeling social media: mining big data in social media and the web (MSM 2014)
    9. Public health in the digital age: social media, crowdsourcing and participatory systems (2nd PHDA 2014)
    10. Simplifying complex networks for practitioners 2014 workshop
    11. 2014 social news on the web workshop
    12. Social recommender systems (SRS2014) workshop
    13. Temporal web analytics workshop (TempWeb'14)
    14. Theory and practice of social machines 2014 workshop
    15. Vertical search relevance 2014 workshop
    16. Web APIs and RESTful design 2014 workshop
    17. Web intelligence and communities workshop (WI&C 2014)
    18. Web observatory workshop (WOW2014)
    19. Web-based education technologies workshop (WebET 2014)
    20. WebQuality 2014 workshop
    21. Big graph mining 2014 workshop
    22. 2014 big scholarly data: towards the web of scholars workshop
    23. Connecting online & offline life workshop (COOL 2014)
    24. Data extraction and object search 2014 workshop
    25. 2014 large scale network analysis workshop (LSNA'14)

WWW 2014-04-07 Volume 2

WWW 2014 PhD/doctoral presentations

Quality assurance in crowdsourcing via matrix factorization based task routing BIBAFull-TextFull-Text 3-8
  Hyun Joon Jung
We investigate a method of crowdsourced task routing based on matrix factorization. From a preliminary analysis of a real crowdsourced data, we begin an exploration of how to route crowdsourcing task via Matrix factorization (MF) which efficiently estimate missing values in a worker-task matrix. Our preliminary results show the benefits of task routing over random assignment, the strength of probabilistic MF over baseline methods.
Spatial semantic search in location-based web services BIBAFull-Text 9-14
  Jeong-Hoon Park
As GPS-enabled mobile devices have advanced, the location-based service (LBS) became one of the most active subjects in the Web-based services. Major Web-based services such as Google Picasa, Twitter, Facebook, and Flicker employ LBS as one of their main features. Consequently, a large number of geotagged documents are generated by users in the Web-based services. Recently, there have been studies on the spatial keyword search which aims to find a set of documents in the Web-based services by evaluating the spatial relevance and keyword relevance. It is a combination of the spatial search and keyword search, each of which has been studied for a long time.
   In this paper, we address the spatial semantic search problem which is to find top k relevant sets of documents with spatial constraints and semantic constraints. For devising an effective solution of the spatial semantic search, we propose a hybrid index strategy, a ranking model and an efficient search algorithm. In addition, we present the current status of our research progress, and discuss remaining challenges and future works.
Time-aware topic-based contextualization BIBAFull-Text 15-20
  Nam Khanh Tran
In the past, various studies have been proposed to acquire the capacity to perceive and comprehend language in articles or human communications. Recently, researchers focus on higher semantic levels to what human would need to understand the contents of articles. While human can smoothly interpret documents when they have knowledge of the context of documents, they have difficulty with those as their context is lost or changes. In this PhD proposal, we address three novel research questions: detecting uninterpretable pieces in documents, retrieving contextual information and constructing compact context for the documents, then propose approaches to these tasks, and discuss related issues.
Entity linking on graph data BIBAFull-Text 21-26
  Minghe Yu
With the emergence of massive information networks, graph data have become ubiquitous for various applications. Although many graph processing problems have been studied recently, entity linking on graph data has not received enough attention by the academia and industry, which finds vertex pairs that refer to the same entity from two graphs. There are two main research challenges arising in this problem. The first one is how to determine whether two vertices refer to the same entity which is rather hard for graph data, especially uncertain data, e.g., social networks. The second challenge is to efficiently link the vertices. As existing graph data are rather large, it is very important to devise efficient algorithms to achieve high performance. To address these challenges, in this paper we propose a similarity-based method which takes the vertex pairs with similarity larger than a given threshold as linked entities. We extend existing textual similarity and structural similarity to evaluate similarity between vertices from different graphs. To achieve high quality, we also combine them and propose a hybrid similarity. We also discuss new algorithms to efficiently link entities. We conduct experimental studies on real datasets and the results proves show that our hybrid method achieves high performance and outperform the baseline approaches.
Understanding, leveraging and improving human navigation on the web BIBAFull-Text 27-32
  Philipp Singer
Navigating websites represents a fundamental activity of users on the Web. Modeling this activity, i.e., understanding how predictable human navigation is and whether regularities can be detected has been of interest to researchers for nearly two decades. This is crucial for improving the Web experience of users by e.g., enhancing interfaces or information network structures. This thesis envisions to shedding light on human navigational patterns by trying to understand, leverage and improve human navigation on the Web. One main goal of this thesis is the construction of a versatile framework for modeling human navigational data with the use of Markov chains and for detecting the appropriate Markov chain order by using several advanced inference methods. It allows us to investigate memory and structure in human navigation patterns. Furthermore, we are interested in detecting whether pragmatic human navigational data can be leveraged by e.g., being useful for the task of calculating semantic relatedness between concepts. Finally, we want to find ways of enhancing human navigation models. Concretely, we plan on incorporating prior knowledge about the semantic relatedness between concepts to our Markov chain models as it is known that humans navigate the Web intuitively instead of randomly. Our experiments should be conducted on a variety of distinct navigational data including both goal oriented and free form navigation scenarios. We not only look at navigational paths over websites, but also abstract away to navigational paths over topics in order to get insights into cognitive patterns.
Entity-centric summarization: generating text summaries for graph snippets BIBAFull-Text 33-38
  Shruti Chhabra
In recent times, focus of information retrieval community has shifted from traditional keyword-based retrieval to techniques utilizing the semantics in the text. Since such techniques require the understanding of relationships between entities, efforts are ongoing to organize the Web into large entity-relationship graphs. These graphs can be leveraged to answer complex relationship queries. However, most of the research has focused upon extracting structural information between entities such as a path, Steiner tree, or subgraphs. Little attention has been paid to the comprehension of these structural results, which is necessary for the user to understand relationships encapsulated in these structures. In this doctoral proposal, we pursue the idea of entity-centric summarization and propose a novel framework to produce entity-centric summaries which describe the relationships among input entities. We discuss the inherent challenges associated with each module in the framework and present an evaluation plan. Results from our preliminary experiments are encouraging and substantiate the feasibility of summarization problem.
Enhancing web activities with information visualization BIBAFull-Text 39-44
  Eduardo Graells-Garrido
Many activities people perform on the Web are biased, including activities like reading news, searching for information and connecting with people. Sometimes these biases are inherent in social behavior (like homophily), and sometimes they are external as they affect the system (like media bias). In this thesis proposal, we describe our approach to use information visualization to enhance Web activities performed by regular people (i.e., non-experts) We understand enhancing as reducing bias effects and generating an engaging response from users. Our methodology is based on case studies. We select a Web activity, identify the biases that affect it, and evaluate how the biases affect a population from online social networks using web mining techniques, and then, we design a visualization following an interactive and playful design approach to diminish the previously identified biases. We propose to evaluate the effect of our visualization designs in user studies by comparing them with state-of-the-art techniques considering a playful experiences framework.
Dynamic communities formation through semantic tags BIBAFull-Text 45-50
  Saman Kamran
Taggers in social tagging systems have the main role in giving identities to the objects. Tagged objects also represent perception of their taggers about them and can define identities of their taggers in return. Consequently, identities that are assigned to the objects and taggers have effect on the quality of their categorization and communities formation around them. Tags in social semantic tagging systems have formal definitions because they are mapped to the concepts that are defined in ontologies. Semantic tags are not only able to improve quality of tag assignments by solving some common tags ambiguity problems related to classic folksonomy systems (i.e., in particular polysemy and synonymy), but also to provide some meta data on top of the social relations based on contribution of taggers around semantic tags. Those meta data may be exploited to form dynamic communities which addresses the problems of lack of commonly agreed and evolving meaning of tags in social semantic tagging systems. This paper proposes an approach to form dynamic communities of related taggers around the tagged objects. Because our perceptions in each specific area of knowledge is evolving over time, the goal of our approach is also to evolve the represented knowledge in semantic tagging systems dynamically according to the latest perception of the related users.
Strategic foundation of computational social science BIBAFull-Text 51-56
  Yunkyu Sohn
For decades, scholars of various disciplines have been fretted over strategic interactions, presenting theoretical insights and empirical observations. Despite the central role played by strategic interactions in creating values in the Internet environment, our ability to understand them scientifically and to manage them in practice has remained limited. While engineering communities suffer from not having enough theoretical resource to formalize such phenomena, economics and social sciences lack adequate technology to properly operationalize their theoretical insights, thereby demanding an integrative solution. This project aims to develop a rational-choice-theory-driven framework for computational social science, focusing on social interactions on the Internet. In order to suggest theoretical foundations, validation of the predictions in a controlled environment, and verification of the results in actual platforms, general approaches and a few examples of ongoing research are presented.
Fine-grained data partitioning framework for distributed database systems BIBAFull-Text 57-62
  Ning Xu
With the increasing size of web data and widely adopted parallel cloud computing paradigm, distributed database and other distributed data processing systems, for example Pregel and GraphLab, use data partitioning technology to divide the large data set. By default, these systems use hash partitioning to randomly assign data to partitions, which leads to huge network traffic between partitions. Fine-grained partitioning can better allocate data and minimize the number of nodes involved within a transaction or job while balancing the workload across data nodes as well. In this paper, we propose a novel prototype system, Lute, to provide highly efficient fine-grained partitioning scheme for these distributed systems. Lute maintains a lookup table for each partitioned data set that maps a key to a set of partition ID(s). We use a novel lookup table technology that provides low cost of reading and writing lookup table. Lute provides transaction support and high concurrency writing with Multi Version Concurrency Control (MVCC) as well.
   We implemented a prototype distributed DBMS on Postgresql and used Lute as a middle-ware to provide fine-grained partitioning support. Extensive experiments conducted on a cluster demonstrate the advantage of the proposed approach. The evaluation results show that in comparison with other state-of-the-art lookup table salutations, our approach can significantly improve throughput by about 20% to 70% on TPC-C benchmark.
Systematic SLA data management BIBAFull-Text 63-68
  Katerina Stamou
The cloud computing paradigm emerged with service oriented principles. In the cloud setting, organizations outsource their IT equipment and manage their business processes through virtual services that are typically exchanged over HTTP. Service Level Agreements (SLAs) depict the status of running services. SLAs represent operational contracts that allow providers to estimate their service availability according to their resource capacity. The SLA data schema and content are operationally defined by the type, volume and relations of service elements that organizations operate on their physical resources. Current lack of a uniform SLA standardization leads to semantic and operational differences between SLAs, that are produced and consumed by different organizations. Such differences prohibit common business SLA practices in the cloud computing domain. Our research introduces systematic SLA data management to describe the formalization, storage and processing of SLAs over distributed computing environments. Services in scope are framed within the cloud computing context.

WWW 2014 demonstrations

LASER: a living analytics experimentation system for large-scale online controlled experiments BIBAFull-Text 71-74
  Kwan Hui Lim; Ee-Peng Lim; Palakorn Achananuparp; Adrian Vu; Agus Trisnajaya Kwee; Feida Zhu
Tracking user browsing data and measuring the effectiveness of website design and web services are important to businesses that want to attract the consumers today who spend much more time online than before. Instead of using randomized controlled experiments, the existing approach simply tracks user browsing behaviors before and after a change is made to website design or web services, and evaluate the differences. To address the effects caused by hidden factors (e.g. promotion activities on the website) and to give fair comparison of different website designs, we propose the LASER system, a unified experimentation platform that enables randomized online controlled experiments to be easily conducted with minimal human effort and modifications to the experimented websites. More importantly, the LASER system manages the various aspects of online controlled experiments, namely the selection of participants into groups, exposure of different user interface features or recommendation algorithms to these groups, measuring their responses, and summarizing the results in the visual manner.
iFeel: a system that compares and combines sentiment analysis methods BIBAFull-Text 75-78
  Matheus Araújo; Pollyanna Gonçalves; Meeyoung Cha; Fabrício Benevenuto
Sentiment analysis methods are used to detect polarity in thoughts and opinions of users in online social media. As businesses and companies are interested in knowing how social media users perceive their brands, sentiment analysis can help better evaluate their product and advertisement campaigns. In this paper, we present iFeel, a Web application that allows one to detect sentiments in any form of text including unstructured social media data. iFeel is free and gives access to seven existing sentiment analysis methods: SentiWordNet, Emoticons, PANAS-t, SASA, Happiness Index, SenticNet, and SentiStrength. With iFeel, users can also combine these methods and create a new Combined-Method that achieves high coverage and F-measure. iFeel provides a single platform to compare the strengths and weaknesses of various sentiment analysis methods with a user friendly interface such as file uploading, graphical visualizing, and weight tuning.
Infrastructure support for evaluation as a service BIBAFull-Text 79-82
  Jimmy Lin; Miles Efron
How do we conduct large-scale community-wide evaluations for information retrieval if we are unable to distribute the document collection? This was the challenge we faced in organizing a task on searching tweets at the Text Retrieval Conference (TREC), since Twitter's terms of service forbid redistribution of tweets. Our solution, which we call "evaluation as a service", was to provide an API through which the collection can be accessed for completing the evaluation task. This paper describes the infrastructure underlying the service and its deployment at TREC 2013. We discuss the merits of the approach and potential applicability to other evaluation scenarios.
Why not, WINE? BIBAFull-Text 83-86
  Sourav S. Bhowmick; Aixin Sun; Ba Quan Truong
Despite considerable progress in recent years on Tag-based Social Image Retrieval (TagIR), state-of-the-art TagIR systems fail to provide a systematic framework for end users to ask why certain images are not in the result set of a given query and provide an explanation for such missing results. However, such why-not questions are natural when expected images are missing in the query results returned by a TagIR system. In this demonstration, we present a system called WINE (Why-not questIon aNswering Engine) which takes the first step to systematically answer the why-not questions posed by end-users on TagIR systems. It is based on three explanation models, namely result reordering, query relaxation, and query substitution, that enable us to explain a variety of why-not questions. Our answer not only involves the reason why desired images are missing in the results but also suggestion on how the search query can be altered so that the user can view these missing images in sufficient number.
CrowdFill: a system for collecting structured data from the crowd BIBAFull-Text 87-90
  Hyunjung Park; Jennifer Widom
CrowdFill is a system for collecting structured data from the crowd. Unlike a typical microtask-based approach, CrowdFill shows an entire partially-filled table to all participating workers; workers collaboratively complete the table by filling in empty cells, as well as upvoting and downvoting data entered by other workers, using CrowdFill's intuitive data entry interface. CrowdFill ensures data entry is leading to a final table that satisfies prespecified constraints, and its compensation scheme encourages workers to submit useful, high-quality work. We demonstrate how CrowdFill collects structured data from the crowd, from the perspective of a user as well as from the perspective of workers.
Online behavioral genome sequencing from usage logs: decoding the search behaviors BIBAFull-Text 91-94
  Yang Song; Weiwei Cui; Shixia Liu; Kuansan Wang
We present a system to analyze user interests by analyzing their online behaviors from large-scale usage logs. We surmise that user interests can be characterized by a large collection of features we call the behavioral genes that can be deduced from both their explicit and implicit online behaviors. It is the goal of this research to sequence the entire behavioral genome for online population, namely, to identify the pertinent behavioral genes and uncover their relationships in explaining and predicting user behaviors, so that high quality user profiles can be created and the online services can be better customized using these profiles. Within the scope of this paper, we demonstrate the work using the partial genome derived from web search logs. Our demo system is supported by an open access web service we are releasing and sharing with the research community. The main functions of the web service are: (1) calculating query similarities based on their lexical, temporal and semantic scores, (2) clustering a group of user queries into tasks with the same search and browse intent, and (3) inferring user topical interests by providing a probability distribution over a search taxonomy.
Easy access to the freebase dataset BIBAFull-Text 95-98
  Hannah Bast; Florian Bäurle; Björn Buchhold; Elmar Haußmann
We demonstrate a system for fast and intuitive exploration of the Freebase dataset. This required solving several non-trivial problems, including: entity scores for proper ranking and name disambiguation, a unique meaningful name for every entity and every type, extraction of canonical binary relations from multi-way relations (which in Freebase are modeled via so-called mediator objects), computing the transitive hull of selected relations, and identifying and merging duplicates. Our contribution is two-fold. First, we provide for download an up-to-date version of the Freebase data, enriched and simplified as just sketched. Second, we offer a user interface for exploring and searching this data set. The data set, the user interface and a demo video are available from http://freebase-easy.cs.uni-freiburg.de.
Collaborative adaptive case management with linked data BIBAFull-Text 99-102
  Sebastian Heil; Stefan Wild; Martin Gaedke
An increasing share of today's work is knowledge work. Adaptive Case Management (ACM) assists knowledge workers in handling this collaborative, emergent and unpredictable type of work. Finding suitable workers for specific functions still relies on manual assessment and assignment by persons in charge, which does not scale well. In this paper we discuss a tool for ACM to facilitate this expert finding leveraging existing Web technology. We propose a method to automatically recommend a set of eligible workers utilizing linked data, enriched user profile data from distributed social networks and information gathered from case descriptions. This semantic recommendation method detects similarities between case requirements and worker profiles. The algorithm traverses distributed social graphs to retrieve a ranked list of suitable contributors to a case according to adaptable metrics. For this purpose, we introduce a vocabulary to specify case requirements and a vocabulary to describe skill sets and personal attributes of workers. The semantic recommendation method is demonstrated by a prototypical implementation using a WebID-based distributed social network.
EVIN: building a knowledge base of events BIBAFull-Text 103-106
  Erdal Kuzey; Gerhard Weikum
We present EVIN: a system that extracts named events from news articles, reconciles them into canonicalized events, and organizes them into semantic classes to populate a knowledge base. EVIN exploits different kinds of similarity measures among news, referring to textual contents, entity occurrences, and temporal ordering. These similarities are captured in a multi-view attributed graph. To distill canonicalized events, EVIN coarsens the graph by iterative merging based on a judiciously designed loss function. To infer semantic classes of events, EVIN uses statistical language models. EVIN provides a GUI that allows users to query the constructed knowledge base of events, and to explore it in a visual manner.
Event registry: learning about world events from news BIBAFull-Text 107-110
  Gregor Leban; Blaz Fortuna; Janez Brank; Marko Grobelnik
Event Registry is a system that can analyze news articles and identify in them mentioned world events. The system is able to identify groups of articles that describe the same event. It can identify groups of articles in different languages that describe the same event and represent them as a single event. From articles in each event it can then extract event's core information, such as event location, date, who is involved and what is it about. Extracted information is stored in a database. A user interface is available that allows users to search for events using extensive search options, to visualize and aggregate the search results, to inspect individual events and to identify related events.
Enhancing media enrichment by semantic extraction BIBAFull-Text 111-114
  Michael Krug; Fabian Wiedemann; Martin Gaedke
The opportunities of the Internet combined with new devices and technologies change the end users' habits in media consumption. While end users often search for related information to the currently watched TV show by themselves, we propose to improve this user experience by automatically enriching media using semantic extraction. In our recent work we focused on how to apply media enrichment to distributed screens. Based on the findings we made from our recent prototype we identify several problems and describe how we deal with them. We illustrate a way to achieve cross-platform real-time synchronization using several transport protocols. We propose the usage of sessions to handle multi-user, multi-screen scenarios and introduce techniques for new interaction and customization patterns. We extend our recent approach with the extraction of keywords from given subtitles by utilizing statistical algorithms and natural language processing technologies, which are then used to discover and display related content from the Web. The prototype presented in this paper reflects the improvements of our work. We discuss next research steps and define challenges for further research.
Databugger: a test-driven framework for debugging the web of data BIBAFull-Text 115-118
  Dimitris Kontokostas; Patrick Westphal; Sören Auer; Sebastian Hellmann; Jens Lehmann; Roland Cornelissen
Linked Open Data (LOD) comprises of an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowd-sourced or extracted data of often relatively low quality. We present Databugger, a framework for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. Databugger ensures a basic level of quality by accompanying vocabularies, ontologies and knowledge bases with a number of test cases. The formalization behind the tool employs SPARQL query templates, which are instantiated into concrete quality test queries. The test queries can be instantiated automatically based on a vocabulary or manually based on the data semantics. One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.
Semantic mashup with the online IDE WikiNEXT BIBAFull-Text 119-122
  Pavel Arapov; Michel Buffa; Amel Ben Othmane
The proposed demonstration requests DBPedia.org, gets the results and uses them to populate wiki pages with semantic annotations using RDFaLite. These annotations are persisted in a RDF store and we will show how this data can be reused by other applications, e.g. for a semantic mashup that displays all collected metadata about cities on a single map page. It has been developed using WikiNEXT, a mix between a semantic wiki and a web-based IDE. The tool is online, open source; screencasts are available on YouTube (look for "WikiNext").
SemFacet: semantic faceted search over Yago BIBAFull-Text 123-126
  Marcelo Arenas; Bernardo Cuenca Grau; Evgeny Kharlamov; Sarunas Marciuska; Dmitriy Zheleznyakov; Ernesto Jimenez-Ruiz
In this paper we demonstrate a system SemFacet, that is a proof of concept prototype for our semantic faceted search approach. SemFacet is implemented on top of the Yago knowledge base, powered by the OWL 2 RL triple store RDFox, and the full text search engine Lucene. SemFacet has provided very encouraging results. Via logical reasoning SemFacet can automatically (i) extract facets, (ii) update the faceted query interface with facets relevant for the current stage of the users query construction's session. SemFacet supports faceted queries that are much more expressive than the ones of traditional faceted search applications; in particular SemFacet allows to (i) relate several collections of documents, and (ii) change the focus of queries (and, thus, SemFacet provides control over the documents in the query output to be displayed on the screen). Our approach is fully declarative: the same backend implementation can be used to power faceted search over any application, provided that metadata and knowledge are represented in RDF and OWL 2.
A demonstration of query-oriented distribution and replication techniques for dynamic graph data BIBAFull-Text 127-130
  Alan G. Labouseur; Paul W. Olsen; Kyuseo Park; Jeong-Hyon Hwang
Evolving networks can be modeled as series of graphs that represent those networks at different points in time. Our G* system enables efficient storage and querying of these graph snapshots by taking advantage of their commonalities. In extending G* for scalable and robust operation, we found the classic challenges of data distribution and replication to be imbued with renewed significance. If multiple graph snapshots are commonly queried together, traditional techniques that distribute data over all servers or create identical data replicas result in inefficient query execution. We propose to verify, using live demonstrations, the benefits of our graph snapshot distribution and replication techniques. Our distribution technique adjusts the set of servers storing each graph snapshot in a manner optimized for popular queries. Our replication technique maintains each snapshot replica on a different number of servers, making available the most efficient replica configurations for different types of queries.
YouTube4Two: socializing video platform for two co-present people BIBAFull-Text 131-134
  Alessio Bellino; Giorgio De Michelis; Flavio De Paoli
YouTube4Two is an application that exploits the YouTube media library (through its API) to demonstrate a new style of social interaction. Two co-present people can share a video and act autonomously to navigate the related-video and comment lists, and search for videos. The novelty is that they can use their own smartphones connected via Internet to control the shared application. The application has been designed according to the responsive-web-design (RWD) principle to smoothly pass from desktop interface (controlled by mouse and keyboard) to smartphone interface (with touch control). YouTube4Two introduces the multi-device responsive Web design (MD-RWD) style that extends the RWD style by introducing the separation between displayed content (on a shared screen) and displayed control commands (on personal smartphones) to support shared control over an application.
Cross domain communication in the web of things: a new context for the old problem BIBAFull-Text 135-138
  Nam Giang; Minkeun Ha; Daeyoung Kim
Cross domain communication has been a long-discussed subject in the field of web-based application, especially for any sort of mashups where a single web app combines resources from different locations. This issue becomes more important in the Web of Things context, where every physical resources are exposed to the Web and mashed up by other web applications. In this paper we demonstrate a use case in which cross domain communication is applied in the Web of Things using the HTML5 Cross Document Messaging API (HTML5CDM). In addition, we contribute an advanced implementation of HTML5CDM that brings RESTful communication model to HTML5CDM and supports better concurrent message exchange, which we believe will be of much benefit to web developers. In addition, a time/space evaluation that measures CPU and Memory usage for the developed HTML5CDM library is carried out and the results has proved our implementation's practicability.
iHUB: an information and collaborative management platform for life sciences BIBAFull-Text 139-142
  David E. Salt; Mourad Ouzzani; Eduard C. Dragut; Peter Baker; Srivathsava Rangarajan
We describe ionomicshub, iHUB for short, a large scale cyber-infrastructure to support end-to-end research that aims to improve our understanding of how plants take up, transport and store their nutrient and toxic elements.
Measuring the effectiveness of multi-channel marketing campaigns using online chatter BIBAFull-Text 143-146
  Haggai Roitman; Gilad Barkai; David Konopnicki; Aya Soffer
Measuring the effectiveness of marketing campaigns across different channels is one of the most challenging tasks for today's brand marketers. Such measurement usually relies on a combination of key performance indicators (KPIs), used for assessing various aspects of marketing outcomes. Recently, with the availability of social-media sources, new options for collecting KPIs have become available and numerous social-media monitoring tools and services have emerged. Yet, given the vast media spectrum, which goes beyond social-media channels, existing solutions fail to generalize well and the curation of marketing performance KPIs for most marketing channels still relies on labor intensive means such as surveys and questionnaires. Trying to address the challenges, we propose to demonstrate a novel solution we have developed in IBM: Multi-channel Marketing Monitoring Platform (M3P for short). M3P is better tailored for the marketing performance domain, where online chatter is being harnessed for effective collection of meaningful marketing KPIs across all possible channels. We describe M3P's main challenges and review some of its novel KPIs. We then describe the M3P solution, focusing on its KPI extraction process. Finally, we describe the planned demonstration using a real-world marketing use-case.
Me-link: link me to the media -- fusing audio and visual cues for robust and efficient mobile media interaction BIBAFull-Text 147-150
  Chun-Yen Yeh; Yu-Ming Hsu; Hsinfu Huang; Hong-Wun Jheng; Yu-Chuan Su; Tzu-Hsuan Chiu; Winston Hsu
In this demo, we present a scalable mobile video recognition system, named "Me-link," based on progressive fusion of light-weight audio visual features. With our system, users only have to point the mobile camera to the video they are interested in. The system will capture the frames and sounds, then retrieve relevant information immediately. As the users hold the mobile longer, the system progressively aggregates the cues temporally and then returns more accurate results. We also consider the real world noisy environment, where users may not get clear visual or audio signals. In the aggregation step of audio and visual cues, our system automatically detects the available channel for the final rank. On the server side, users can upload the videos with information via website. Besides, we also link the streaming signals so that users can get the real time broadcasting with "Me-link".
PRISM: a system for weighted multi-color browsing of fashion products BIBAFull-Text 151-154
  Donggeun Yoo; Kyunghyun Paeng; Sunggyun Park; Jungin Lee; Seungwook Paek; Sung-Eui Yoon; In So Kweon
Multiple color search technology helps users find fashion products in a more intuitive manner. Although fashion product images can be represented not only by a set of dominant colors but also by the relative ratio of colors, current online fashion shopping malls often provide rather simple color filters. In this demo, we present PRISM (Perceptual Representation of Image SiMilarity), a weighted multi-color browsing system for fashion products retrieval. Our system combines widely accepted backend web service stacks and various computer vision techniques including a product area parsing and a compact yet effective multi-color description. Finally, we demonstrate the benefits of PRISM system via web service in which users freely browse fashion products.
Online abusive users analytics through visualization BIBAFull-Text 155-158
  Anna Cinzia Squicciarini; Jules Dupont; Ruyan Chen
In this demo, we present Abuse User Analytics (AuA), an analytical framework aiming to provide key information about the behavior of online social network users. AuA efficiently processes data from users' discussions, and renders information about users' activities in a easy to-understand graphical fashion with the goal of identifying deviant or abusive activities. Using animated graphics, AuA visualizes users' degree of abusiveness, measured by several key metrics, over user selected time intervals. It is therefore possible to visualize how users' activities lead to complex interaction networks, and highlight the degenerative connections among users and within certain threads.
AIDR: artificial intelligence for disaster response BIBAFull-Text 159-162
  Muhammad Imran; Carlos Castillo; Ji Lucas; Patrick Meier; Sarah Vieweg
We present AIDR (Artificial Intelligence for Disaster Response), a platform designed to perform automatic classification of crisis-related microblog communications. AIDR enables humans and machines to work together to apply human intelligence to large-scale data at high speed. The objective of AIDR is to classify messages that people post during disasters into a set of user-defined categories of information (e.g., "needs", "damage", etc.) For this purpose, the system continuously ingests data from Twitter, processes it (i.e., using machine learning classification techniques) and leverages human-participation (through crowdsourcing) in real-time. AIDR has been successfully tested to classify informative vs. non-informative tweets posted during the 2013 Pakistan Earthquake. Overall, we achieved a classification quality (measured using AUC) of 80%. AIDR is available at http://aidr.qcri.org/.
LiveCities: revealing the pulse of cities by location-based social networks venues and users analysis BIBAFull-Text 163-166
  Alberto Del Bimbo; Andrea Ferracani; Daniele Pezzatini; Federico D'Amato; Martina Sereni
It would be very difficult even for a resident to characterise the social dynamics of a city and to reveal to foreigners the evolving activity patterns which occur in its various areas. To address this problem, however, large amount of data produced by location-based social networks (LBSNs) can be exploited and combined effectively with techniques of user profiling. The key idea we introduce in this demo is to improve city areas and venues classification using semantics extracted both from places and from the online profiles of people who frequent those places. We present the results of our methodology in LiveCities, a web application which shows the hidden character of several Italian cities through clustering and information visualisations paradigms. In particular we give in-depth insights of the city of Florence, IT, for which the majority of the data in our dataset have been collected. The system provides personal recommendation of areas and venues matching user interests and allows the free exploration of urban social dynamics in terms of people lifestyle, business, demographics, transport etc. with the objective to uncover the real 'pulse' of the city. We conducted a qualitative validation through an online questionnaire with 28 residents of Florence to understand the shared perception of city areas by its inhabitants and to check if their mental maps align to our results. Our evaluation shows how considering also contextual semantics like people profiles of interests in venues categorisation can improve clustering algorithms and give good insights of the endemic characteristics and behaviours of the detected areas.
CityBeat: real-time social media visualization of hyper-local city data BIBAFull-Text 167-170
  Chaolun Xia; Raz Schwartz; Ke Xie; Adam Krebs; Andrew Langdon; Jeremy Ting; Mor Naaman
With the increasing volume of location-annotated content from various social media platforms like Twitter, Instagram and Foursquare, we now have real-time access to people's daily documentation of local activities, interests and attention. In this demo paper, we present CityBeat, a real-time visualization of hyper-local social media content for cities. The main objective of CityBeat is to provide users -- with a specific focus on journalists -- with information about the city's ongoings, and alert them to unusual activities. The system collects a stream of geo-tagged photos as input, uses time series analysis and classification techniques to detect hyper-local events, and compute trends and statistics. The demo includes a visualization of this information that is designed to be installed on a large-screen in a newsroom, as an ambient display.
Help yourself: a virtual self-assist system BIBAFull-Text 171-174
  Subhabrata Mukherjee; Sachindra Joshi
In this work, we describe an unsupervised framework for creating self-assist systems which can serve as virtual call center agents to guide the customer in performing different domain-dependent tasks (like troubleshooting a problem, changing settings etc.). We describe a framework for creating an intent graph from a corpus of knowledge articles from a given domain which is used in creating the dialogue system. To the best of our knowledge, this is the first work in creating virtual self-assist agents.
Exploring the web of spatial data with facete BIBAFull-Text 175-178
  Claus Stadler; Michael Martin; Sören Auer
The majority of data (including data published on the Web as Linked Open Data) has a spatial dimension. However, the efficient, user friendly exploration of spatial data remains a major challenge. We present Facete, a web-based exploration and visualization application enabling the spatial-faceted browsing of data with a spatial dimension. Facete implements a novel spatial data exploration paradigm based on the following three key components: First, a domain independent faceted filtering module, which operates directly on SPARQL and supports nested facets. Second, an algorithm that efficiently detects spatial information related to those resources that satisfy the facet selection. The detected relations are used for automatically presenting data on a map. And third, a workflow for making the map display interact with data sources that contain large amounts of geometric information. We demonstrate Facete in large-scale, real world application scenarios.
SocRoutes: safe routes based on tweet sentiments BIBAFull-Text 179-182
  Jaewoo Kim; Meeyoung Cha; Thomas Sandholm
Location-based services, and in particular personal navigation systems, have become increasingly popular with the widespread use of GPS technology in smart devices. Existing navigation systems are designed to suggest routes based on the shortest distance or the fastest time to a target. In this paper, we propose a new type of route navigation based on regional context -- primarily sentiments. Our system, called SocRoutes, aims to find a safer, friendlier, and more enjoyable route based on sentiments inferred from real-time, geotagged messages from Twitter. SocRoutes tailors routes by avoiding places with extremely negative sentiments, thereby potentially finding a safer and more enjoyable route with marginal increase in total distance compared to the shortest path. The system supports three types of traveling modes: walking, bicycling, and driving. We validated the idea based on crime history data from the City of Chicago Portal in December 2012, and sentiments extracted from geotagged tweets during the same time. We discovered that there was a significant correlation between regional Twitter posting sentiments and crime rate, in particular for high-crime and highly negative sentiment areas. We also demonstrated that SocRoutes, by solely utilizing social media sentiments, can find routes that bypass crime hotspots.

WWW 2014 tutorials

Scalability and efficiency challenges in large-scale web search engines BIBAFull-Text 185-186
  Ricardo Baeza-Yates; B. Barla Cambazoglu
The main goals of a web search engine are quality, efficiency, and scalability. In this tutorial, we focus on the last two goals, providing a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in these components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points at open research problems and provides recommendations to researchers who are new to the field.
Concept-level sentiment analysis: a world wide web conference 2014 tutorial BIBAFull-Text 187-188
  Erik Cambria
The WWW'14 tutorial on Concept-Level Sentiment Analysis aims to provide its participants means to efficiently design models, techniques, tools, and services for concept-level sentiment analysis and their commercial realizations. The tutorial draws on insights resulting from the recent IEEE Intelligent Systems special issues on Concept-Level Opinion and Sentiment Analysis and the IEEE CIM special issue on Computational Intelligence for Natural Language Processing. The tutorial includes a hands-on session to illustrate how to build a concept-level opinion-mining engine step-by-step, from semantic parsing to concept-level reasoning.
E-commerce product search: personalization, diversification, and beyond BIBAFull-Text 189-190
  Atish Das Sarma; Nish Parikh; Neel Sundaresan
The focus of this tutorial will is e-commerce product search. Several challenges appear in this context, both from a research standpoint as well as an application standpoint. We present various approaches adopted in the industry, review well-known research techniques developed over the last decade, draw parallels to traditional web search highlighting the new challenges in this setting, and dig deep into some of the algorithmic and technical approaches developed. A specific approach that advances theoretical techniques and illustrates practical impact considered here is of identifying most suited results quickly from a large database. Settings span cold start users and advanced users for whom personalization is possible. In this context, top-$k$ and skylines are discussed as they form a key approach that spans the web, data mining, and database communities. These present powerful tools for search across multi-dimensional items with clear preferences within each attribute, like product search as opposed to regular web search.
Online learning and linked data: lessons learned and best practices BIBAFull-Text 191-192
  John Domingue; Alexander Mikroyannidis; Stefan Dietze
Following the latest developments in online learning and Linked Data, the scope of this tutorial will be two-fold: 1. New online learning methods will be taught for supporting the teaching of Linked Data. Additionally, the lessons learned and the best practices derived from designing and delivering a Linked Data curriculum by the EUCLID project will be discussed. 2. Ways in which Linked Data principles and technologies can be used to support online learning and create innovative educational services will be explained, based on the experience developed in the development of existing Linked Data applications for online learning. We will in particular rely on the data catalogue, use cases and applications considered by the LinkedUp project.
Towards a social media analytics platform: event detection and user profiling for Twitter BIBAFull-Text 193-194
  Manish Gupta; Rui Li; Kevin Chen-Chuan Chang
Microblog data differs significantly from the traditional text data with respect to a variety of dimensions. Microblog data contains short documents, SMS kind of language, and is full of code mixing. Though a lot of it is mere social babble, it also contains fresh news coming from human sensors at a humungous rate. Given such interesting characteristics, the world wide web community has witnessed a large number of research tasks for microblogging platforms recently. Event detection on Twitter is one of the most popular such tasks with a large number of applications. The proposed tutorial on social analytics for Twitter will contain three parts. In the first part, we will discuss research efforts towards detection of events from Twitter using both the tweet content as well as other external sources. We will also discuss various applications for which event detection mechanisms have been put to use. Merely detecting events is not enough. Applications require that the detector must be able to provide a good description of the event as well. In the second part, we will focus on describing events using the best phrase, event type, event timespan, and credibility. In the third part, we will discuss user profiling for Twitter with a special focus on user location prediction. We will conclude with a summary and thoughts on future directions.
Tutorial on social recommender systems BIBAFull-Text 195-196
  Ido Guy
In recent years, with the proliferation of the social web, users are exposed to an intensively growing social overload. Social recommender systems aim to address this overload and are becoming integral part of virtually any leading website, playing a key factor in its success. In this tutorial, we will review the broad domain of social recommender systems, the underlying techniques and methodologies; the data in use, recommended entities, and target population; evaluation techniques; applications; and open issues and challenges.
The mobile semantic web BIBAFull-Text 197-198
  Shonali Krishnaswamy; Yuan-Fang Li
The combination of the versatility of smart devices and the capabilities of semantic technologies forms a great foundation for a mobile Semantic Web that will contribute to further realising the true potential of both disciplines. Motivated by a service discovery and matchmaking example, this tutorial provides an overview of background knowledge in ontology languages, basic reasoning problems, and how they are applicable in the mobile environment. It aims at presenting a timely survey of state-of-the-art development and challenges on mobile ontology reasoning, focusing on the reasoning and optimization techniques developed in the mTableaux framework. Finally, the tutorial closes with a summary of important research problems and an outlook of future research directions in this area.
Social spam, campaigns, misinformation and crowdturfing BIBAFull-Text 199-200
  Kyumin Lee; James Caverlee; Calton Pu
This tutorial will introduce peer-reviewed research work on information quality on social systems. Specifically, we will address new threats such as social spam, campaigns, misinformation and crowdturfing, and overview modern techniques to improve information quality by revealing and detecting malicious participants (e.g., social spammers, content polluters and crowdturfers) and low quality contents.
Re-using media on the web BIBFull-Text 201-202
  Lyndon Nixon; Vasileios Mezaris; Raphael Troncy
Entity resolution in the web of data BIBAFull-Text 203-204
  Kostas Stefanidis; Vasilis Efthymiou; Melanie Herschel; Vassilis Christophides
This tutorial provides an overview of the key research results in the area of entity resolution that are relevant to addressing the new challenges in entity resolution posed by the Web of data, in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking but also to search the Web of data for entities and their relations.
Computational models for social influence analysis: [extended abstract] BIBAFull-Text 205-206
  Jie Tang; Jimeng Sun
Social influence occurs when one's opinions, emotions, or behaviors are affected by others, intentionally or unintentionally. In this article, we survey recent research progress on social influence analysis. In particular, we first give a brief overview of related background knowledge, and then discuss what is social influence. We try to answer this question in terms of homophily and the process of influence and selection. After that, we focus on describing computational models for social influence including models for influence probability learning and influence diffusion. Finally, we discuss potential applications of social influence.
Trust in social computing BIBAFull-Text 207-208
  Jiliang Tang; Huan Liu
The rapid development of social media exacerbates the information overload and credibility problems. Trust, providing information about with whom we can trust to share information and from whom we can accept information, plays an important role in helping users collect relevant and reliable information in social media. Trust has become a research topic of increasing importance and of practical significance. In this tutorial, we illustrate properties and representation models of trust, elucidate trust measurements with representative algorithms, and demonstrate real-world applications where trust is explicitly used. As a new dimension of the trust study, we discuss the concept of distrust and its roles in trust measurements and applications.
Learning to efficiently rank on big data BIBAFull-Text 209-210
  Lidan Wang; Jimmy Lin; Donald Metzler; Jiawei Han
Ranking in response to user queries is a central problem in information retrieval, data mining, and machine learning. In the era of "Big data", traditional effectiveness-centric ranking techniques tend to get more and more costly (requiring additional hardware and energy costs) to sustain reasonable ranking speed on large data. The mentality of combating big data by throwing in more hardware/machines will quickly become highly expensive since data is growing at an extremely fast rate oblivious to any cost concerns from us. "Learning to efficiently rank" offers a cost-effective solution to ranking on large data (e.g., billions of documents). That is, it addresses a critically important question -- whether it is possible to improve ranking effectiveness on large data without incurring (too much) additional cost?

WWW 2014 posters

How effectively can we form opinions? BIBAFull-Text 213-214
  AmirMahdi Ahmadinejad; Sina Dehghani; MohammadTaghi Hajiaghayi; Hamid Mahini; Saeed Seddighin; Sadra Yazdanbod
People make decisions and express their opinions according to their communities. An appropriate idea for controlling the diffusion of an opinion is to find influential people, and employ them to spread the desired opinion. We investigate an influencing problem when individuals' opinions are affected by their friends due to the model of Friedkin and Johnsen [4]. Our goal is to design efficient algorithms for finding opinion leaders such that changing their opinions has great impact on the overall opinion of the society.
   We define a set of problems like maximizing the sum of individual opinions or maximizing the number of individuals whose opinions are above a threshold. We discuss the complexity of the defined problems and design optimum algorithms for the non NP-hard variants of the problems. Furthermore, we run simulations on real-world social network data and show our proposed algorithm outperforms the classical algorithms such as degree-based, closeness-based, and pagerank-based algorithms.
ComPAS: maximizing data availability with replication in ad-hoc social networks BIBAFull-Text 215-216
  Ahmedin Mohammed Ahmed; Qiuyuan Yang; Nana Yaw Asabere; Tie Qiu; Feng Xia
Although existing replica allocation protocols perform well in most cases, some challenges still need to be addressed to further improve their performance. The success of such protocols for Ad-hoc Social Networks (ASNETs) depends on the performance of data accessibility and on the easy consistency management of available replica. We contribute to this line of research with replication protocol for maximizing availability of a data. Essentially, we propose ComPAS, a community-partitioning aware replica allocation method. Its goals include integration of social relationship for placing copy of the data in the community to achieve better efficiency and consistency by keeping the replica read cost, relocation cost and traffic as low as possible.
Discovering and learning sensational episodes of news events BIBAFull-Text 217-218
  Xiang Ao; Ping Luo; Chengkai Li; Fuzhen Zhuang; Qing He; Zhongzhi Shi
This paper studies the problem of discovering and learning sensational 2-episodes, i.e., pairs of co-occurring news events. To find all frequent episodes, we propose an efficient algorithm, MEELO, which significantly outperforms conventional methods. Given many frequent episodes, we rank them by their sensational effect. Instead of limiting ourselves to any individual subjective measure of sensational effect, we propose a learning-to-rank approach that exploits multiple features to capture the sensational effect of an episode from various aspects. An experimental study on real data verified our approach's efficiency and effectiveness.
Towards semantic faceted search BIBAFull-Text 219-220
  Marcelo Arenas; Bernardo Cuenca Grau; Evgeny Evgeny; Sarunas Marciuska; Dmitriy Zheleznyakov
In this paper we present limitations of conventional faceted search in the way data, facets, and queries are modelled. We discuss how these limitations can be addressed with Semantic Web technologies such as RDF, OWL 2, and SPARQL 1.1. We also present a system, SemFacet, that is a proof-of-concept prototype of our approach implemented on top of Yago knowledge base, powered by the OWL 2 RL triple store RDFox, and the full text search engine Lucene.
Metadata-driven hypertext content publishing and styling BIBAFull-Text 221-222
  Xi Bai; Armin Haller; Ewan Klein; Dave Robertson
A growing number of approaches and tools have been utilised attempting at generating hypertext content with embedded metadata. However, little work has been carried out on finding a generic solution for publishing and styling Web pages with annotations derived from existing RDF data sets available in various formats. This paper proposes a metadata-driven publishing framework assisting publishers or webmasters in generating semantically-enriched content (HTML pages or snippets) by harnessing distributed RDF(a) documents or repositories with little human intervention. This framework also helps users to create and share so-called micro-themes, which is applicable to the above generated content for the purpose of page styling and also highly reusable thanks to the adopted semantic attribute selectors.
Collective attention to social media evolves according to diffusion models BIBAFull-Text 223-224
  Christian Bauckhage; Kristian Kersting; Bashir Rastegarpanah
We investigate patterns of adoption of 175 social media services and Web businesses using data from Google Trends. For each service, we collect aggregated search frequencies from 45 countries as well as global averages. This results in more than 8.000 time series which we analyze using economic diffusion models. The models are found to provide accurate and statistically significant fits to the data and show that collective attention to social media grows and subsides in a highly regular manner. Regularities persist across regions, cultures, and topics and thus hint at general mechanisms that govern the adoption of Web-based services.
Acquiring commonsense knowledge for sentiment analysis using human computation BIBAFull-Text 225-226
  Marina Boia; Claudiu Cristian Musat; Boi Faltings
Many Artificial Intelligence tasks need commonsense knowledge. Extracting this knowledge with statistical methods would require huge amounts of data, so human computation offers a better alternative. We acquire contextual knowledge for sentiment analysis by asking workers to indicate the contexts that influence the polarities of sentiment words. The increased complexity of the task causes some workers to give superficial answers. To increase motivation, we make the task more engaging by packaging it as a game. With the knowledge compiled from only a small set of answers, we already halve the gap between machine and human performance. This proves the strong potential of human computation for acquiring commonsense knowledge.
BUbiNG: massive crawling for the masses BIBAFull-Text 227-228
  Paolo Boldi; Andrea Marino; Massimo Santini; Sebastiano Vigna
Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine tools and at the same time scales linearly with the amount of resources available. This paper aims at filling this gap.
Status and friendship: mechanisms of social network evolution BIBAFull-Text 229-230
  Christina Brandt; Jure Leskovec
We examine the evolution of five social networking sites where complex networks of social relationships developed: Twitter, Flickr, DeviantArt, Delicious, and Yahoo! Answers. We study the differences and similarities in edge creation mechanisms in these social networks. We find large differences in edge reciprocation rates and overall structure of the underlying networks. We demonstrate that two mechanisms can explain these disparities: directed triadic closure, which leads to networks that show characteristics of status-oriented behavior, and reciprocation, which leads to friendship-oriented behavior. We develop a model that demonstrates how variances in these mechanisms lead to characteristic differences in the expression of network subgraph motifs. Lastly, we show how a user's future popularity, her indegree, can be predicted based on her initial edge creation behavior.
A wiki way of programming for the web of data BIBAFull-Text 231-232
  Pavel Arapov; Michel Buffa; Amel Ben Othmane
WikiNEXT is a wiki engine that enables users to write rapidly applications directly from the browser, in particular applications that can exploit the web of data. WikiNEXT relies on semantic web formalisms and technologies (RDF/RDFa lite) to describe wiki page content and embedded metadata, and to manipulate them (for example, using the SPARQL language). WikiNEXT is a mix between a web-based IDE (Integrated Development Environment) and a semantic wiki. It embeds several editors (a WYSIWYG editor, and an HTML/JavaScript editor + a JavaScript library manager) for coding in the browser, provides an API for exploiting semantic metadata, and uses a graph based data store and an object oriented database for persistence on the server side. It has been specially designed for writing online programming tutorials (i.e. an HTML5 tutorial, a semantic web tutorial on how to consume linked data, etc.), or more generally for developing web applications that can be mixed with more classical wiki documents (in fact all WikiNEXT pages are web applications). The tool is online, open source; screencasts are available on YouTube (look for 'WikiNEXT').
The (un)supervised detection of overlapping communities as well as hubs and outliers via (Bayesian) NMF BIBAFull-Text 233-234
  Xiaochun Cao; Xiao Wang; Di Jin; Yixin Cao; Dongxiao He
The detection of communities in various networks has been considered by many researchers. Moreover, it is preferable for a community detection method to detect hubs and outliers as well. This becomes even more interesting and challenging when taking the unsupervised assumption, that is, we do not assume the prior knowledge of the number K of communities. In this poster, we define a novel model to identify overlapping communities as well as hubs and outliers. When K is given, we propose a normalized symmetric nonnegative matrix factorization algorithm to learn the parameters of the model. Otherwise, we introduce a Bayesian symmetric nonnegative matrix factorization to learn the parameters of the model, while determining K. Our experiment indicate its superior performance on various networks.
Recommendation for advertising messages on mobile devices BIBAFull-Text 235-236
  Chih-Chun Chan; Yu-Chieh Lin; Ming-Syan Chen
Mobile devices, especially smart phones, have been popular in recent years. With users spending much time on mobile devices, service providers deliver advertising messages to mobile device users and look forward to increasing their revenue. However, delivery of proper advertising messages is challenging since strategies of advertising in TV, SMS, or website may not be applied to the banner-based advertising on mobile devices. In this work, we study how to properly recommend advertising messages for mobile device users. We propose a novel approach which simultaneously considers several important factors: user profile, apps used, and clicking history. We apply experiments on real-world mobile log data, and the results demonstrate the effectiveness of the proposed approach.
A pruning algorithm for optimal diversified search BIBAFull-Text 237-238
  Fei Chen; Yiqun Liu; Jian Li; Min Zhang; Shaoping Ma
Given a number of possible sub-intents (also called subtopics) for a certain query and their corresponding search results, diversified search aims to return a single result list that could satisfy as many users' intents as possible. Previous studies have demonstrated that finding the optimal solution for diversified search is NP-hard. Therefore, several algorithms have been proposed to obtain a local optimal ranking with greedy approximations. In this paper, a pruned exhaustive search algorithm is proposed to decrease the complexity of the optimal search for the diversified search problem. Experimental results indicate that the proposed algorithm can decrease the computation complexity of exhaustive search without any performance loss.
Sentiment-enhanced explanation of product recommendations BIBAFull-Text 239-240
  Li Chen; Feng Wang
Because of the important role of product reviews during users' decision process, we propose a novel explanation interface that particularly fuses the feature sentiments as extracted from reviews into explaining recommendations. Besides, it can explain multiple items altogether by revealing their similarity in respect of feature sentiments as well as static specifications, so as to support users' tradeoff making. Relative to existing works, we believe that this interface can be more effective, trustworthy, and persuasive.
Finding local experts on Twitter BIBAFull-Text 241-242
  Zhiyuan Cheng; James Caverlee; Himanshu Barthwal; Vandana Bachani
We address the problem of identifying local experts on Twitter. Specifically, we propose a local expertise framework that integrates both users' topical expertise and their local authority by leveraging over 15 million geo-tagged Twitter lists. We evaluate the proposed approach across 16 queries coupled with over 2,000 individual judgments from Amazon Mechanical Turk. Our initial experiments find significant improvement over a naive local expert finding approach, suggesting the promise of exploiting geo-tagged Twitter lists for local expert finding.
Adaptive presentation of linked data on mobile BIBAFull-Text 243-244
  Luca Costabello; Fabien Gandon
We present PRISSMA, a context-aware presentation layer for Linked Data. PRISSMA extends the Fresnel vocabulary with the notion of mobile context. Besides, it includes an algorithm that determines whether the sensed context is compatible with some context declarations.
Recommending without short head BIBAFull-Text 245-246
  Paolo Cremonesi; Franca Garzotto; Roberto Pagano; Massimo Quadrana
We discuss a comprehensive study exploring the impact of recommender systems when recommendations are forced to omit popular items (short head) and to use niche products only (long tail). This is an interesting issue in domains, such as e-tourism, where product availability is constrained, "best sellers" most popular items are the first ones to be consumed, and the short head may eventually become unavailable for recommendation purposes. Our work provides evidence that the effects resulting from item consumption may increase the utility of personalized recommendations.
The "expression gap": do you like what you share? BIBAFull-Text 247-248
  Atish Das Sarma; Si Si; Elizabeth F. Churchill; Neel Sundaresan
While recommendation profiles increasingly leverage social actions such as "shares", the predictive significance of such actions is unclear. To what extent do public shares correlate with other online behaviors such as searches, views and purchases? Based on an analysis of 950,000 users' behavioral, transactional, and social sharing data on a global online commerce platform, we show that social "shares", or publicly posted expressions of interest do not correlate with non-public behaviors such as views and purchases. A key takeaway is that there is a "gap" between public and non-public actions online, suggesting that marketers and advertisers need to be cautious in their estimation of the significance of social sharing.
RDF mapping rules refinements according to data consumers' feedback BIBAFull-Text 249-250
  Anastasia Dimou; Miel Vander Sande; Tom De Nies; Erik Mannens; Rik Van de Walle
The missing feedback loop is considered the reason for broken Data Cycles on current Linked Open Data ecosystems. Read-Write platforms are proposed, but they are restricted to capture modifications after the data is released as Linked Data. Triggering though a new iteration results in loosing the data consumers' modifications, as a new version of the source data is mapped, overwriting the currently published. We propose a prime solution that interprets the data consumers' feedback to update the mapping rules. This way, data publishers initiate a new iteration of the Data Cycle considering the data consumers' feedback when they map a new version of the published data.
How social is social tagging? BIBAFull-Text 251-252
  Stephan Doerfel; Daniel Zoller; Philipp Singer; Thomas Niebler; Andreas Hotho; Markus Strohmaier
Social tagging systems have established themselves as an important part in today's web and have attracted the interest of our research community in a variety of investigations. This has led to several assumptions about tagging, such as that tagging systems exhibit a social component. In this work we overcome the previous absence of data for testing such an assumption. We thoroughly study social interaction, leveraging for the first time live log data gathered from the real-world public social tagging system bibs. Our results indicate that sharing of resources constitutes an important and indeed social aspect of tagging.
Who am I on Twitter?: a cross-country comparison BIBAFull-Text 253-254
  Wei Dong; Minghui Qiu; Feida Zhu
Users often manage which aspects of their personal identities to be manifested on social network sites (SNS). Thus, the content of personal information disclosed on users' profiles can be influenced by a number of factors, such as motivation of using SNS and privacy concerns, both of which may vary depending on where users reside in. In this study, we compared the content of 2800 United States (US) and Singapore (SG) Twitter users' bios on their profile pages. We found US Twitter users were far more likely to disclose personal information that may reveal their true identity than SG users. The between country difference remained after we took bio length and user activity level into account. The results provide important insights on future studies to understand users' privacy concern in different regions of the world.
DBLP-filter: effectively search on the DBLP bibliography BIBAFull-Text 255-256
  Jiang Du; Peiquan Jin; Lizhou Zheng; Shouhong Wan; Lihua Yue
DBLP is a well-known online computer science bibliography. As nearly all important journals and conferences on computer science are tracked in DBLP, how to effectively search DBLP records has become a valuable topic for the computer science community. In this paper we present DBLP-Filter, a new DBLP search tool. The major features of DBLP-Filter are: (1) it provides new search options on concepts and literature importance; (2) it can maintain user profiles and can support user-area-aware search; (3) it provides the service of new literatures alert. Compared with the existing DBLP search tools, DBLP-Filter is more functional and also shows better effectiveness in terms of MAP and F-measure when tested under a set of randomly-selected queries.
Perceptron-based tagging of query boundaries for Chinese query segmentation BIBAFull-Text 257-258
  Jingfei Du; Yan Song; Chi-Ho Li
Query boundaries carry useful information for query segmentation, especially when analyzing queries in a language with no space, e.g., Chinese. This paper presents our research on Chinese query segmentation via averaged perceptron to model query boundaries through an L-R tagging scheme on a large amount of unlabeled queries. Experimental results indicate that query boundaries are very informative and they significantly improve supervised Chinese query segmentation when labeled training data is very limited.
RESTful open workflows for data provenance and reuse BIBAFull-Text 259-260
  Kai Eckert; Dominique Ritze; Konstantin Baierer; Christian Bizer
In this paper, we present a workflow model together with an implementation following the Linked Data principles and the principles for RESTful web services. By means of RDF-based specifications of web services, workflows, and runtime information, we establish a full provenance chain for all resources created within these workflows.
What's all the data about?: creating structured profiles of linked data on the web BIBFull-Text 261-262
  Besnik Fetahu; Stefan Dietze; Bernardo Pereira Nunes; Marco Antonio Casanova; Davide Taibi; Wolfgang Nejdl
De-anonymizing social graphs via node similarity BIBAFull-Text 263-264
  Hao Fu; Aston Zhang; Xing Xie
Recently, a number of anonymization algorithms have been developed to protect the privacy of social graph data. However, in order to satisfy higher level of privacy requirements, it is sometimes impossible to maintain sufficient utility. Is it really easy to de-anonymize "lightly" anonymized social graphs? Here "light" anonymization algorithms stand for those algorithms that maintain higher data utility. To answer this question, we proposed a de-anonymization algorithm based on a node similarity measurement. Using the proposed algorithm, we evaluated the privacy risk of several "light" anonymization algorithms on real datasets.
Contextual insights BIBAFull-Text 265-266
  Ariel Fuxman; Patrick Pantel; Yuanhua Lv; Ashok Chandra; Pradeep Chilakamarri; Michael Gamon; David Hamilton; Bernhard Kohlmeier; Dhyanesh Narayanan; Evangelos Papalexakis; Bo Zhao
In today's productivity environment, users are constantly researching topics while consuming or authoring content in applications such as e-readers, word processors, presentation programs, or social networks. However, none of these applications sufficiently enable users to do their research directly within the application. In fact, users typically have to switch to a browser and write a query on a search engine. Switching to a search engine is distracting and hurts productivity. Furthermore, the main problem is that the search engine is not aware of important user context such as the book that they are reading or the document they are authoring. To tackle this problem, we introduce the notion of contextual insights: providing users with information that is contextually relevant to the content that they are consuming or authoring. We then present Leibniz, a system that provides a solution for the contextual insights problem.
Partout: a distributed engine for efficient RDF processing BIBAFull-Text 267-268
  Luis Galárraga; Katja Hose; Ralf Schenkel
The increasing interest in Semantic Web technologies has led not only to a rapid growth of semantic data on the Web but also to an increasing number of backend applications relying on efficient query processing. Confronted with such a trend, existing centralized state-of-the-art systems for storing RDF and processing SPARQL queries are no longer sufficient. In this paper, we introduce Partout, a distributed engine for fast RDF processing in a cluster of machines. We propose an effective approach for fragmenting RDF data sets based on a query log and allocating the fragments to hosts in a cluster of machines. Furthermore, Partout's query optimizer produces efficient query execution plans for ad-hoc SPARQL queries.
Effective and effortless features for popularity prediction in microblogging network BIBAFull-Text 269-270
  Shuai Gao; Jun Ma; Zhumin Chen
Predicting popularity of online contents is of remarkable practical value in various business and administrative applications. Existing studies mainly focus on finding the most effective features for prediction. However, some effective features, such as structural features which are extracted from the underlying user network, are hard to access. In this paper, we aim to identify features that are both effective and effortless (easy to obtain or compute). Experiments on Sina Weibo show the effectiveness and effortlessness of the temporal features and satisfying prediction performance can be obtained based on only the temporal features of first 10 retweets.
A topic based document relevance ranking model BIBAFull-Text 271-272
  Yang Gao; Yue Xu; Yuefeng Li
Topic modelling has been widely used in the fields of information retrieval, text mining, machine learning, etc. In this paper, we propose a novel model, Pattern Enhanced Topic Model (PETM), which makes improvements to topic modelling by semantically representing topics with discriminative patterns, and also makes innovative contributions to information filtering by utilising the proposed PETM to determine document relevance based on topics distribution and maximum matched patterns proposed in this paper. Extensive experiments are conducted to evaluate the effectiveness of PETM by using the TREC data collection Reuters Corpus Volume 1. The results show that the proposed model significantly outperforms both state-of-the-art term-based models and pattern-based models.
GLASE-IRUKA: gaze feedback improves satisfaction in exploratory image search BIBAFull-Text 273-274
  Viktors Garkavijs; Rika Okamoto; Tetsuo Ishikawa; Mayumi Toshima; Noriko Kando
We propose two methods for exploratory image search systems using gaze data for continuous learning of the users' interests and relevance calculation. The first system uses the fixation time over the images selected by gaze in the search results pages, whereas the second one utilizes the fixation time over the clicked images and fixations over the non-selected images on the results page. A user model is trained and continuously updated from the gaze input throughout the whole session in both systems. We conducted an experiment with 24 users, each performing four search tasks using the proposed systems and compared the results to a baseline system, which does not employ any gaze data. The Gaze feedback system users viewed 22.35% more images than the users of the baseline system. A high correlation between the number of saved images and the satisfaction with the results was observed in data collected from the users of a mouse feedback system enriched by gaze data. The results show that including the gaze data into the relevance calculation in both cases increases the degree of satisfaction with the search results compared with the baseline.
A semi-supervised method for opinion target extraction BIBAFull-Text 275-276
  Tao Ge; Wenjie Li; Zhifang Sui
This paper proposes a semi-supervised self-learning method, which is based on a Naive Bayes classifier exploiting context features and PMI scores, to extract opinion targets. The experimental results indicate our bootstrapping framework is effective for this task and outperforms the state-of-the-art models on COAE2008 dataset2, especially in precision.
Localized CAPTCHA testing on users and farms BIBAFull-Text 277-278
  Ekaterina Gladkikh; Kirill Nikolaev; Mikhail Nikitin
The paper describes the experience of resisting the large-scale solving of CAPTCHA through the CAPTCHA-farms and presents the results of experimenting with different types of textual CAPTCHA on the farm worker and casual user crowds. Localization of CAPTCHA led to cutting twice the absolute volume of CAPTCHA parsing, but introducing the semantics into the test complicated it to casual users and was not found promising.
Allocating tasks to workers with matching constraints: truthful mechanisms for crowdsourcing markets BIBAFull-Text 279-280
  Gagan Goel; Afshin Nikzad; Adish Singla
Designing optimal pricing policies and mechanisms for allocating tasks to workers is central to the online crowdsourcing markets. In this paper, we consider the following realistic setting of online crowdsourcing markets -- there is a requester with a limited budget and a heterogeneous set of tasks each requiring certain skills; there is a pool of workers and each worker has certain expertise and interests which define the set of tasks she can and is willing to do. Under the matching constraints given by this bipartite graph between workers and tasks, we design our incentive-compatible mechanism TM-Uniform which allocates the tasks to the workers, while ensuring budget feasibility and achieves near-optimal utility for the requester. Apart from strong theoretical guarantees, we carry out experiments on a realistic case study of Wikipedia translation project on Mechanical Turk. We note that this is the first paper to address this setting from a mechanism design perspective.
People of opposing views can share common interests BIBAFull-Text 281-282
  Eduardo Graells-Garrido; Mounia Lalmas; Daniele Quercia
In online social networks, people tend to connect with like-minded people and read agreeable information. Direct recommendation of challenging content has not worked well because users do not value diversity and avoid challenging content. In this poster, we investigate the possibility of an indirect approach by introducing intermediary topics, which are topics that are common to people having opposing views on sensitive issues, i.e., those issues that tend to divide people. Through a case study about a sensitive issue discussed in Twitter, we show that such intermediary topics exist, opening a path for future work in recommendation promoting diversity of content to be shared.
Generating ad targeting rules using sparse principal component analysis with constraints BIBAFull-Text 283-284
  Mihajlo Grbovic; Slobodan Vucetic
Determining the right audience for an advertising campaign is a well-established problem, of central importance to many Internet companies. Two distinct targeting approaches exist, the model-based approach, which leverages machine learning, and the rule-based approach, which relies on manual generation of targeting rules. Common rules include identifying users that had interactions (website visits, emails received, etc.) with the companies related to the advertiser, or search queries related to their product. We consider a problem of discovering such rules from data using Constrained Sparse PCA. The constraints are put in place to account for cases when evidence in data suggests a relation that is not appropriate for advertising. Experiments on real-world data indicate the potential of the proposed approach.
Cross market modeling for query-entity matching BIBAFull-Text 285-286
  Manish Gupta; Prashant Borole; Praful Hebbar; Rupesh Mehta; Niranjan Nayak
Given a query, the query-entity (QE) matching task involves identifying the best matching entity for the query. When modeling this task as a binary classification problem, two issues arise: (1) features in specific global markets (like de-at: German users in Austria) are quite sparse compared to other markets like en-us, and (2) training data is expensive to obtain in multiple markets and hence limited. Can we leverage some form of cross market data/features for effective query-entity matching in sparse markets? Our solution consists of three main modules: (1) Cross Market Training Data Leverage (CMTDL) (2) Cross Market Feature Leverage (CMFL), and (3) Cross Market Output Data Leverage (CMODL). Each of these parts perform "signal" sharing at different points during the classification process. Using a combination of these strategies, we show significant improvements in query-impression weighted coverage for the query-entity matching task.
Dynamic provenance for SPARQL updates using named graphs BIBAFull-Text 287-288
  Harry Halpin; James Cheney
While the (Semantic) Web currently does have a way to exhibit static provenance information in the W3C PROV standards, the Web does not have a way to describe dynamic changes to data. While some provenance models and annotation techniques originally developed with databases or workflows in mind transfer readily to RDF, RDFS and SPARQL, these techniques do not readily adapt to describing changes in dynamic RDF datasets over time. In this paper we explore how to adapt the dynamic copy-paste provenance model of Buneman et al. to RDF datasets that change over time in response to SPARQL updates, how to represent the resulting provenance records themselves as RDF using named graphs in a manner compatible with W3C PROV, and how the provenance information can be provided as a SPARQL query. The primary contribution is a semantic framework that enables the semantics of SPARQL Update to be used as the basis for a 'cut-and-paste' provenance model in a principled manner.
Characterizing user interest using heterogeneous media BIBAFull-Text 289-290
  Jonghyun Han; Hyunju Lee
It is often hard to accurately estimate interests of social media users because their messages do not have additional information, such as a category. In this paper, we propose an approach that estimates user interest from social media to provide personalized services. Our approach employs heterogeneous media to map social media onto categories. To describe the categories, we propose a hybrid method that integrates a topic model with TF-ICF for extracting both explicitly presented and implicitly latent features. Our evaluation result shows that it gives the highest performance, compared to other approaches. Thus, we expect that the proposed approach is helpful in advancing personalization of social media services.
Bing-SF-IDF+: semantics-driven news recommendation BIBAFull-Text 291-292
  Frederik Hogenboom; Michel Capelle; Marnix Moerland; Flavius Frasincar
Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting scheme for terms occurring in news messages and user profiles. Semantics-driven variants such as SF-IDF additionally take into account term meaning by exploiting synsets from semantic lexicons. However, they ignore the various semantic relationships between synsets, providing only for a limited understanding of news semantics. Moreover, semantics-based weighting techniques are not able to handle -- often crucial -- named entities, which are usually not present in semantic lexicons. Hence, we extend SF-IDF by also considering the synset semantic relationships, and by employing named entity similarities using Bing page counts. Our proposed method, Bing-SF-IDF+, outperforms TF-IDF and SF-IDF in terms of F1 scores and kappa statistics.
Inferring social relationships from mobile sensor data BIBAFull-Text 293-294
  Hsun-Ping Hsieh; Cheng-Te Li
While mobile sensors are ubiquitous nowadays, the geographical activities of human beings are feasible to be collected and the geo-spatial interactions between people can be derived. As we know there is an underlying social network between mobile users, such social relationships are hidden and hold by service providers. Acquiring the social network over mobile users would enable lots of applications, such as friend recommendation and energy-saving mobile DB management. In this paper, we propose to infer the social relationships using the sensor data, which contains the encounter records between individuals, without any knowledge about the real friendships in prior. We propose a two-phase prediction method for the social inference. Experiments conducted on the CRAWDAD data demonstrate the encouraging results with satisfying prediction scores of precision and recall.
Inferring visiting time distributions of locations from incomplete check-in data BIBAFull-Text 295-296
  Hsun-Ping Hsieh; Cheng-Te Li
Online location-based services, such as Foursquare and Facebook, provide a great resource for location recommendation. As we know the time is one of the important factors on recommending places with proper time for users, since the pleasure of visiting a place could be diminished if arriving at wrong time, we propose to infer the visiting time distributions of locations. We assume the check-in data used is incomplete because in real-world scenarios it is hard or unavailable to collect all the temporal information of locations and the check-in behaviors might be abnormal. To tackle such problem, we devise a visiting time inference framework, VisTime-Miner, which considers the route-based visiting correlation of time labels to model the visiting behavior of a location. Experiments on a large-scaled Gowalla check-in data show a promising result.
Deriving latent social impulses to determine longevous videos BIBAFull-Text 297-298
  Qingbo Hu; Guan Wang; Philip S. Yu
Online video websites receive huge amount of videos daily from users all around the world. How to provide valuable recommendation of videos to viewers is important for video websites. Previous studies focus on analyzing the view count of a video, which measures the video's value in terms of popularity. However, the long-lasting value of an online video, namely longevity, is hidden behind the history that a video accumulates its "popularity" through time. Generally speaking, a longevous video tends to constantly draw society's attention. With a focus on YouTube, this paper proposes a scoring mechanism to quantify the longevity of videos. We introduce the concept of latent social impulses and use them to assess a video's longevity. In order to derive latent social impulses, we view the video website as a digital signal filter and formulate the task as a convex minimization problem. The proposed longevity computation is based on the derived social impulses. Unfortunately, the required information to derive social impulses is not always public, which disallows a third party to directly evaluate the longevity of all videos. Thus, we formulate a semi-supervised learning task by using videos of which the longevity scores are known to predict the unknown ones. We develop a Gaussian Random Markov Field model with Loopy Belief Propagation to solve it. The experiments on YouTube demonstrate that the proposed method significantly improves the prediction results comparing to two baseline models.
Data imputation using a trust network for recommendation BIBAFull-Text 299-300
  Won-Seok Hwang; Shaoyu Li; Sang-Wook Kim; Kichun Lee
Recommendation methods suffer from the data sparsity and cold-start user problems, often resulting in low accuracy. To address these problems, we propose a novel imputation method, which effectively densifies a rating matrix by filling unevaluated ratings with probable values. In our method, we use a trust network to estimate the unevaluated ratings accurately. We conduct experiments on the Epinions dataset and demonstrate that our method helps provide better recommendation accuracy than previous methods, especially for cold-start users.
Short-text representation using diffusion wavelets BIBAFull-Text 301-302
  Vidit Jain; Jay Mahadeokar
Usual text document representations such as tf-idf do not work well in classification tasks for short-text documents and across diverse data domains. Optimizing different representations for different data domains is infeasible in a practical setting on the Internet. Mining such representations from the data in an unsupervised manner is desirable. In this paper, we study a representation based on the multi-scale harmonic analysis of term-term co-occurrence graph. This representation is not only sparse, but also leads to the discovery of semantically coherent topics in data. In our experiments on user-generated short documents e.g., newsgroup messages, user comments, and meta-data, we found this representation to outperform other representations across different choice of classifiers. Similar improvements were also observed for data sets in Chinese and Portuguese languages.
Trust prediction using positive, implicit, and negative information BIBAFull-Text 303-304
  Min-Hee Jang; Christos Faloutsos; Sang-Wook Kim
We propose a novel method to predict accurately trust relationships of a target user even if he/she does not have much interaction information. The proposed method considers positive, implicit, and negative information of all users in a network based on belief propagation to predict trust relationships of a target user.
Detecting suspicious following behavior in multimillion-node social networks BIBAFull-Text 305-306
  Meng Jiang; Peng Cui; Alex Beutel; Christos Faloutsos; Shiqiang Yang
In a multimillion-node network of who-follows-whom like Twitter, since a high count of followers leads to higher profits, users have the incentive to boost their in-degree. Can we spot the suspicious following behavior, which may indicate zombie followers and suspicious followees? To answer the above question, we propose CatchSync, which exploits two tell-tale signs of the suspicious behavior: (a) synchronized behavior: the zombie followers have extremely similar following behavior pattern, because, say, they are generated by a script; and (b) abnormal behavior: their behavior pattern is very different from the majority. Our CatchSync introduces novel measures to quantify both concepts and catches the suspicious behavior. Moreover, we show it is effective in a real-world social network.
Cognitive resource-aware web service selection in mobile computing environments BIBAFull-Text 307-308
  Angel Jimenez-Molina; In-Young Ko
The proactive and spontaneous delivery of services for mobile users on the move can lead to the depletion of users' mental resources, affecting the normal processes of their physical activities. This is due to the competition for limited mental resources between the human-computer interactions required by services and the users' physical activities. To deal with this problem we propose a service selection method based on two theories from cognitive psychology. This mechanism assesses the degree of demand for mental resources by both the physical activities and the services. Additionally, a service binding and scheduling algorithm ensures less cognitively-taxing mobile service compositions.
Learning from unstructured multimedia data BIBAFull-Text 309-310
  Janani Kalyanam; Gert R. G. Lanckriet
Information in today's world is highly heterogeneous and unstructured. Learning and inferring from such data is challenging and is an active research topic. In this paper, we present and investigate an approach to learning from heterogeneous and unstructured multimedia data. Inspired by approaches in many fields including computer vision, we investigate a histogram based approach to represent multimodal unstructured data. While existing works have predominantly focused on histogram based approaches for unimodal data, we present a methodology to represent unstructured multimodal data. We explain how to discover the prototypical features or codewords over which these histograms are built. We present experimental results on classification and retrieval tasks performed on the histogram based representation.
Hierarchical interest graph from tweets BIBAFull-Text 311-312
  Pavan Kapanipathi; Prateek Jain; Chitra Venkataramani; Amit Sheth
Industry and researchers have identified numerous ways to monetize microblogs for personalization and recommendation. A common challenge across these different works is the identification of user interests. Although techniques have been developed to address this challenge, a flexible approach that spans multiple levels of granularity in user interests has not been forthcoming. In this work, we focus on exploiting hierarchical semantics of concepts to infer richer user interests expressed as a Hierarchical Interest Graph. To create such graphs, we utilize users' tweets to first ground potential user interests to structured background knowledge such as Wikipedia Category Graph. We then adapt spreading activation theory to assign user interest score to each category in the hierarchy. The Hierarchical Interest Graph not only comprises of users' explicitly mentioned interests determined from Twitter, but also their implicit interest categories inferred from the background knowledge source.
Cognitive search intents hidden behind queries: a user study on query formulations BIBAFull-Text 313-314
  Makoto P. Kato; Takehiro Yamamoto; Hiroaki Ohshima; Katsumi Tanaka
This study investigated query formulations by users with Cognitive Search Intents (CSI), which are needs for the cognitive characteristics of documents to be retrieved, eg. comprehensibility, subjectivity, and concreteness. We proposed an example-based method of specifying search intents to observe unbiased query formulations. Our user study revealed that about half our subjects did not input any keywords representing CSIs, even though they were conscious of given CSIs.
Identifying spreaders of malicious behaviors in online games BIBAFull-Text 315-316
  Youngjoon Ki; Jiyoung Woo; Huy Kang Kim
Massively multiplayer online role-playing games (MMORPGs) simulate the real world and require highly complex user interaction. Many human behaviors can be observed in MMORPGs, such as social interactions, economic behaviors, and malicious behaviors. In this study, we primarily focus on malicious behavior, especially cheating using game bots. Bots can be diffused on the social network in an epidemic style. When bots diffuse on the social network, a user's influence on the diffusion process varies owing to different characteristics and network positions. We aim to identify the influentials in the game world and investigate how they differ from normal users. Identifying the influentials in the diffusion of malicious behaviors will enable the game company to act proactively and preventively towards malicious users such as game bot users.
Link prediction based on generalized cluster information BIBAFull-Text 317-318
  Jungeun Kim; Minsoo Choy; Daehoon Kim; U. Kang
Understanding of which new interactions among data objects are likely to occur in the future is crucial for a deeper understanding of network dynamics and evolution. This question is largely unexplored except a local neighborhood perspective, partly owing to the difficulty in finding major factors which heavily affect the link prediction problem. In this paper, we propose LPCSP, a novel link prediction method which exploits the generalized cluster information containing cluster relations and cluster evolution information. Experiments show that our proposed LPCSP is accurate, scalable, and useful for link prediction on real world graphs.
Finding informative Q&As on Twitter BIBAFull-Text 319-320
  Kanghak Kim; Sunho Lee; Jeonghoon Son; Meeyoung Cha
Question & Answer (Q&A) behaviors on social media have huge potential as a rich source of information and knowledge online. However, little is known about how much diversity there exists in the topics covered in such Q&As and whether unstructured social media data can be made searchable. This paper seeks the feasibility of utilizing social media data for developing a Q&A service by examining the topic coverage in Twitter conversations. We propose a new framework to automatically extract informative Q&A content using machine learning techniques.
Macro-level information transfer across social networks BIBAFull-Text 321-322
  Minkyoung Kim; David Newth; Peter Christen
This study proposes a model-free approach to infer macro-level information flow across online social systems in terms of the strength and directionality of influence among systems.
A computational analysis of agenda setting BIBAFull-Text 323-324
  Yeooul Kim; Suin Kim; Alejandro Jaimes; Alice Oh
Agenda setting theory explains how media affects its audience. While traditional media studies have done extensive research on agenda setting, there are important limitations in those studies, including using a small set of issues, running costly surveys of public interest, and manually categorizing the articles into positive and negative frames. In this paper, we propose to tackle these limitations with a computational approach and a large dataset of online news. Overall, we demonstrate how to carry out a large-scale computational research of agenda setting with online news data using machine learning.
Investigating socio-cultural behavior of users reflected in different social channels on K-pop BIBAFull-Text 325-326
  Yonghwan Kim; Dahee Lee; Jung Eun Hahm; Namgi Han; Min Song
In this paper we investigated the socio-cultural behavior of users reflected in the two different social media channels, YouTube and Twitter. We conducted the comparative analysis of the networks generated from the two channels. The relationship we set for each network is the relatedness on YouTube and the co-links on Twitter. From the results, we revealed that the social media influenced the distinct socio-cultural behaviors of their users. Specifically, Twitter network better showed the actual consumption of contents in the field of the k-pop culture than YouTube. From this study, we contributed to offer a novel approach for exploring the socio-cultural behavior of users on the social media.
Semantically enhanced keyword search for smartphones BIBAFull-Text 327-328
  Jihoon Ko; Sangjin Shin; Sungkwang Eom; Minjae Song; Dong-Hoon Shin; Kyong-Ho Lee; Yongil Jang
To apply semantic search to smartphones, we propose an efficient semantic search method based on a lightweight mobile ontology. Through a prototype implementation of a semantic search engine on an android smartphone, experimental results show that the proposed method provides more accurate search results and a better user experience compared to the conventional method.
Motives for mass interactions in online sports viewing BIBAFull-Text 329-330
  Minsam Ko; Seung-woo Choi; Joonwon Lee; Subin Yang; Uichin Lee; Aviv Segev; Junehwa Song
Recent advances of social TV services allow sports fans to watch sports games at any place and to lively interact with other fans via online chatting. Despite their popularity, however, so far little is known about the key properties of such mass interactions in online sports viewing. In this paper, we explore motives for mass interactions in online sports viewing and investigate how the motives are related to viewing and chatting behaviors, by studying Naver Sports, the largest online sports viewing service in Korea.
The market of internet sponsored links in the context of competition law: can modeling help? BIBAFull-Text 331-332
  Natalia Kudryashova
Internet search market of key words attracts much attention in conjunction with the legal proceedings against Google. It has been recognized that legal argumentation alone may not be sufficient to disentangle the complexity of the case. An approach that includes mathematical modeling is needed, to distinguish the effects of the factors intrinsic to the market and the consequences of anticompetitive practices. This paper proposes a modeling framework based on explicit treatment of users' switching between the search platforms in the environment set by the platforms' strategic decisions, and demonstrates some of its applications.
Photo recall: using the internet to label your photos BIBAFull-Text 333-334
  Neeraj Kumar; Steven M. Seitz
We describe a system for searching your personal photos using an extremely wide range of text queries, including dates and holidays ("Halloween"), named and categorical places ("Empire State Building" or "park"), events and occasions ("Radiohead concert" or "wedding"), activities ("skiing"), object categories ("whales"), attributes ("outdoors"), and object instances ("Mona Lisa"), and any combination of these -- all with no manual labeling required. We accomplish this by correlating information in your photos -- the timestamps, GPS locations, and image pixels -- to information mined from the Internet. This includes matching dates to holidays listed on Wikipedia, GPS coordinates to places listed on Wikimapia, places and dates to find named events using Google, visual categories using classifiers either pre-trained on ImageNet or trained on-the-fly using results from Google Image Search, and object instances using interest point-based matching, again using results from Google Images. We tie all of these disparate sources of information together in a unified way, allowing for fast and accurate searches using whatever information you remember about a photo.
Learning to predict trending queries: classification -- based BIBAFull-Text 335-336
  Chi-Hoon Lee; HengShuai Yao; Xu He; Su Han Chan; JieYang Chang; Farzin Maghoul
Among the many tasks driven by very large scaled web search queries, it is an interesting task to predict how likely queries about a topic become popular (a.k.a. trending or buzzing) as the news in the near future, which is known as "Detecting trending queries." This task is nontrivial since the realization of buzzing trends of queries often requires sufficient statistics through users' activities. To address this challenge, we propose a novel framework that predicts whether queries become trending in the future. In principle, our system is built on the two learners. The first is to learn dynamics of time series for queries. The second, our decision maker, is to learn a binary classifier that determines whether queries become trending. Our framework is extremely efficient to be built taking advantage of the grid architecture that allows to deal with the large volume of data. In addition, it is flexible to continuously adapt as trending patterns evolve. The experiments results show that our approach achieves high quality of accuracy (over 77.5%} true positive rate) and yet detects much earlier (on average 29 hours advanced) than that of the baseline system.
Users' behavioral prediction for phishing detection BIBAFull-Text 337-338
  Lung-Hao Lee; Kuei-Ching Lee; Yen-Cheng Juan; Hsin-Hsi Chen; Yuen-Hsien Tseng
This study explores the users' web browsing behaviors that confront phishing situations for context-aware phishing detection. We extract discriminative features of each clicked URL, i.e., domain name, bag-of-words, generic Top-Level Domains, IP address, and port number, to develop a linear chain CRF model for users' behavioral prediction. Large-scale experiments show that our method achieves promising performance for predicting the phishing threats of users' next accesses. Error analysis indicates that our model results in a favorably low false positive rate. In practice, our solution is complementary to the existing anti-phishing techniques for cost-effectively blocking phishing threats from users' behavioral perspectives.
Finding k-highest betweenness centrality vertices in graphs BIBAFull-Text 339-340
  Min-Joong Lee; Chin-Wan Chung
The betweenness centrality is a measure for the relative participation of the vertex in the shortest paths in the graph. In many cases, we are interested in the k-highest betweenness centrality vertices only rather than all the vertices in a graph. In this paper, we study an efficient algorithm for finding the exact k-highest betweenness centrality vertices.
Towards online review spam detection BIBAFull-Text 341-342
  Yuming Lin; Tao Zhu; Xiaoling Wang; Jingwei Zhang; Aoying Zhou
User reviews play a crucial role in Web, since many decisions are made based on them. However, review spam would misled the users, which is extremely obnoxious. In this poster, we explore the problem of online review spam detection. Firstly, we devise six features to find the spam based on the review content and reviewer behaviors. Secondly, we apply supervised methods and an unsupervised one for spotting the review spam as early as possible. Finally, we carry out intensive experiments on a real-world review set to verify the proposed methods.
Efficient RDF stream reasoning with graphics processingunits (GPUs) BIBAFull-Text 343-344
  Chang Liu; Jacopo Urbani; Guilin Qi
In this paper, we study the problem of stream reasoning and propose a reasoning approach over large amounts of RDF data, which uses graphics processing units (GPU) to improve the performance. First, we show how the problem of stream reasoning can be reduced to a temporal reasoning problem. Then, we describe a number of algorithms to perform stream reasoning with GPUs.
Automatic keywords generation for contextual advertising BIBAFull-Text 345-346
  Pengqi Liu; Javad Azimi; Ruofei Zhang
Contextual Advertising (CA) is an important area in the industry of online advertising. Typically, CA algorithms return a set of related ads based on some keywords extracted from the content of webpages. Therefore, extracting the best set of representative keywords from a given webpage is the key to the success of CA. In this paper, we introduce a new keywords generation approach that uses some novel NLP features including POS and named-entities tagging. Unlike most of the existing keyword extraction algorithms, our proposed framework is able to generate some related keywords which do not exist in the webpage. A monetization parameter, predicted from historical search keyword performance, is also used to rank potential keywords in order to balance the RPM (Revenue Per 1000 Matches) and relevance. Experimental results over a very large real-world data set shows that the proposed approach can outperform the state-of-the-art system in both relevance and monetization metrics.
Detecting trending topics using page visitation statistics BIBAFull-Text 347-348
  Sayandev Mukherjee; Ronald Sujithan; Pero Subasic
Many applications including realtime recommenders and ad-targeting systems have a need to identify trending concepts to prioritize the information presented to end-users. In this paper, we describe a novel approach that identifies trending concepts using the hourly Wikipedia page visitation statistics freely available for download. We describe a MapReduce framework that analyzes the raw hourly visitation logs and generates a ranked list of trending concepts on a daily basis. We validate this approach by extracting hourly lists of trending news articles, mapping these articles to Wikipedia concepts, and computing the similarity of the two lists according to several commonly used measures.
Unsupervised approach for shallow domain ontology construction from corpus BIBAFull-Text 349-350
  Subhabrata Mukherjee; Jitendra Ajmera; Sachindra Joshi
In this work we propose an unsupervised approach to construct a domain-specific ontology from corpus. It is essential for Information Retrieval systems to identify important domain concepts and relationships between them. We identify important domain terms of which multi-words form an important component. Our approach identifies 40% of the domain terms, compared to 22% identified by WordNet on manually annotated smartphone data. We propose an approach to construct a shallow ontology from discovered domain terms by identifying four domain relations namely, Synonyms ('similar-to'), Type-Of ('is-a'), Action-On ('methods') and Feature-Of ('attributes'), where we achieve an F-Score of 49.14%, 65.5%, 65% and 80% respectively.
SepaRating: an approach to reputation computation based on rating separation in e-marketplace BIBAFull-Text 351-352
  Hyun-Kyo Oh; Yoohan Noh; Sang-Wook Kim; Sunju Park
This paper proposes SepaRating, a novel mechanism that separates a buyer's rating on a transaction into two kinds of scores: seller's score and item's score. SepaRating provides the reputation of sellers correctly based on the seller's score by repetitive separations, which helps potential buyers to find more reliable sellers. We verify the effectiveness of SepaRating via a series of experiments.
Semantic annotation for dynamic web environment BIBAFull-Text 353-354
  Jeong-Hoon Park; Chin-Wan Chung
The semantic Web is a promising future Web environment. In order to realize the semantic Web, the semantic annotation should be widely available. The studies for generating the semantic annotation do not provide a solution to the 'document evolution' requirement which is to maintain the consistency between semantic annotations and Web pages. In this paper, we propose an efficient solution to the requirement, that is to separately generate the long-term annotation and the short-term annotation. The experimental results show that our approach outperforms an existing approach which is the most efficient among the automatic approaches based on static Web pages.
Learning joint representation for community question answering with tri-modal DBM BIBAFull-Text 355-356
  Baolin Peng; Wenge Rong; Yuanxin Ouyang; Chao Li; Zhang Xiong
One of the main research tasks in Community question answering (CQA) is to find most relevant questions for a given new query, thereby providing useful knowledge for the users. Traditionally used methods such as bag-of-words or latent semantic models consider queries, questions and answers in a same feature space. However, the correlations among queries, questions and answers imply that they lie in different feature spaces. In light of these issues, we proposed a tri-modal deep boltzmann machine (tri-DBM) to extract unified representation for query, question and answer. Experiments on Yahoo! Answers dataset reveal using these unified representation to train a classifier judging semantic matching level between query and question outperforms models using bag-of-words or LSA representation significantly.
Efficient CPU-GPU work sharing for data-parallel JavaScript workloads BIBAFull-Text 357-358
  Xianglan Piao; Channoh Kim; Younghwan Oh; Hanjun Kim; Jae W. Lee
Modern web browsers are required to execute many complex, compute-intensive applications, mostly written in JavaScript. With widespread adoption of heterogeneous processors, recent JavaScript-based data-parallel programming models, such as River Trail and WebCL, support multiple types of processing elements including CPUs and GPUs. However, significant performance gains are still left on the table since the program kernel runs on only one compute device, typically selected at kernel invocation. This paper proposes a new framework for efficient work sharing between CPU and GPU for data-parallel JavaScript workloads. The work sharing scheduler partitions the input data into smaller chunks and dynamically dispatches them to both CPU and GPU for concurrent execution. For four data-parallel programs, our framework improves performance by up to 65% with a geometric mean speedup of 33% over GPU-only execution.
User profiles based on revisitation times BIBAFull-Text 359-360
  Philipp Pushnyakov; Gleb Gusev
Our work is devoted to Web revisitation patterns of individual users. Everybody revisits Web pages, but their reasons for doing so can differ. We analyzed Web interaction logs of millions users to characterize how people revisit Web content. We revealed that each user have its own distribution of revisitation times. This distribution follows Power Law with some exponent, which captures specific user peculiarities.
Combining geographical information of users and content of items for accurate rating prediction BIBAFull-Text 361-362
  Zhi Qiao; Peng Zhang; Jing He; Yanan Cao; Chuan Zhou; Li Guo
Recommender systems have attracted attentions lately due to their wide and successful applications in online advertising. In this paper, we propose a Bayesian generative model to describe the generative process of rating, which combines geographical information of users and content of items. The generative model consists of two interacting LDA models, where one LDA model for location-based user groups (user dimension) and the other for the topics of content of items (item dimension). A Gibbs sampling algorithm is proposed for parameter estimation. Experiments have shown our proposed method outperforms baseline methods.
RDF-X: a language for sanitizing RDF graphs BIBAFull-Text 363-364
  Jyothsna Rachapalli; Vaibhav Khadilkar; Murat Kantarcioglu; Bhavani Thuraisingham
With the advent of Semantic Web and Resource Description Framework (RDF), the web is likely to witness an unprecedented wealth of knowledge, resulting from seamless integration of various data sources. Data integration is one of the key features of RDF, however, absence of secure means for managing sensitive RDF data may prevent sharing of critical data altogether or may cause serious damage. Towards this end we present a language for sanitizing RDF graphs, which comprises a set of sanitization operations that transform a graph by concealing the sensitive data. These operations are modeled into a new SPARQL query form known as SANITIZE, which can also be leveraged towards fine grained access control and building advanced anonymization features.
Fast maximum clique algorithms for large graphs BIBAFull-Text 365-366
  Ryan A. Rossi; David F. Gleich; Assefaw H. Gebremedhin; Md. Mostofa Ali Patwary
We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. Despite clique's status as an NP-hard problem with poor approximation guarantees, our method exhibits nearly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Key to the efficiency of our algorithm are an initial heuristic procedure that finds a large clique quickly and a parallelized branch and bound strategy with aggressive pruning and ordering techniques. We use the algorithm to compute the largest temporal strong components of temporal contact networks.
Implicit feature detection for sentiment analysis BIBAFull-Text 367-368
  Kim Schouten; Flavius Frasincar
Implicit feature detection is a promising research direction that has not seen much research yet. Based on previous work, where co-occurrences between notional words and explicit features are used to find implicit features, this research critically reviews its underlying assumptions and proposes a revised algorithm, that directly uses the co-occurrences between implicit features and notional words. The revision is shown to perform better than the original method, but both methods are shown to fail in a more realistic scenario.
Summarizing social image search results BIBAFull-Text 369-370
  Boon-Siew Seah; Sourav S. Bhowmick; Aixin Sun
Most existing social image search engines present search results as a ranked list of images, which cannot be consumed by users in a natural and intuitive manner. Here, we present a novel algorithm that exploits both visual features and tags of the search results to generate high quality image search result summary. The summary not only breaks the results into visually and semantically coherent clusters, but it also maximizes the coverage of the original search results. We demonstrate the effectiveness of our method against state-of-the-art image summarization and clustering algorithms.
TOMOHA: TOpic model-based HAshtag recommendation on Twitter BIBAFull-Text 371-372
  Jieying She; Lei Chen
On Twitter, hashtags are used to summarize topics of the tweet content and to help to categorize and search tweets. However, hashtags are created in a free style and thus heterogeneous, increasing difficulty of their usage. We propose TOMOHA, a supervised TOpic MOdel-based solution for HAshtag recommendation on Twitter. We treat hashtags as labels of topics, and develop a supervised topic model to discover relationship among words, hashtags and topics of tweets. We also novelly add user following relationship into the model. We infer the probability that a hashtag will be contained in a new tweet, and recommend k most probable ones. We propose parallel computing and pruning techniques to speed up model training and recommendation process. Experiments show that our method can properly and efficiently recommend hashtags.
Learning semantic representations using convolutional neural networks for web search BIBAFull-Text 373-374
  Yelong Shen; Xiaodong He; Jianfeng Gao; Li Deng; Grégoire Mesnil
This paper presents a series of new latent semantic models based on a convolutional neural network (CNN) to learn low-dimensional semantic vectors for search queries and Web documents. By using the convolution-max pooling operation, local contextual information at the word n-gram level is modeled first. Then, salient local features in a word sequence are combined to form a global feature vector. Finally, the high-level semantic information of the word sequence is extracted to form a global vector representation. The proposed models are trained on clickthrough data by maximizing the conditional likelihood of clicked documents given a query, us-ing stochastic gradient ascent. The new models are evaluated on a Web document ranking task using a large-scale, real-world data set. Results show that our model significantly outperforms other semantic models, which were state-of-the-art in retrieval performance prior to this work.
Defending against user identity linkage attack across multiple online social networks BIBAFull-Text 375-376
  Yilin Shen; Fengjiao Wang; Hongxia Jin
We study the first countermeasure against user identity linkage attack across multiple online social networks (OSNs). Our goal is to keep as much as user's information in public and meanwhile prevent their identities from being linked on different OSNs via k-anonymity. We develop a novel greedy algorithm, incorporating an efficient manner to compute the greedy function, and validate it in terms of both solution quality and robustness using real-world datasets.
Beyond modeling private actions: predicting social shares BIBAFull-Text 377-378
  Si Si; Atish Das Sarma; Elizabeth F. Churchill; Neel Sundaresan
We study the problem of predicting sharing behavior from e-commerce sites to friends on social networks via share widgets. The contextual variation in an action that is private (like rating a movie on Netflix), to one shared with friends online (like sharing an item on Facebook), to one that is completely public (like commenting on a YouTube video) introduces behavioral differences that pose interesting challenges. In this paper, we show that users' interests manifest in actions that spill across different types of channels such as sharing, browsing, and purchasing. This motivates leveraging all such signals available from the e-commerce platform. We show that carefully incorporating signals from these interactions significantly improves share prediction accuracy.
Face recognition CAPTCHA made difficult BIBAFull-Text 379-380
  Terence Sim; Hossein Nejati; James Chua
A CAPTCHA is a Turing test to distinguish human users from automated scripts to defend against internet adversarial attacks. As text-based CAPTCHAs (TBC) have become increasingly difficult to solve, image-based CAPTCHAs, and particularly face recognition CAPTCHAs (FRC), offer a chance to overcome TBC limitations. In this paper, we systematically design and implement a practical FRC, informed by psychological findings. We use gray-scale and binary images, which are computationally inexpensive to generate and deploy. Furthermore, our FRC complies with CAPTCHA design guidelines, thereby ensuring its robustness.
Searching for design examples with crowdsourcing BIBAFull-Text 381-382
  Nikita Spirin; Motahhare Eslami; Jie Ding; Pooja Jain; Brian Bailey; Karrie Karahalios
Examples are very important in design, but existing tools for design example search still do not cover many cases. For instance, long tail queries containing subtle and subjective design concepts, like "calm and quiet", "elegant", "dark background with a hint of color to make it less boring", are poorly supported. This is mainly due to the inherent complexity of the task, which so far has been tackled only algorithmically using general image search techniques. We propose a powerful new approach based on crowdsourcing, which complements existing algorithmic approaches and addresses their shortcomings. Out of many explored crowdsourcing configurations we found that (1) a design need should be represented via several query images and (2) AMT crowd workers should assess a query-specific relevance of a candidate example from a pre-built design collection. To test the utility of our approach, we compared it with Google Images in a query-by-example mode. Based on feedback from expert designers, the crowd selects more relevant design examples.
Semantic search engine with an intuitive user interface BIBAFull-Text 383-384
  Adam Styperek; Michal Ciesielczyk; Andrzej Szwabe
It is crucial to enable regular users to explore RDF-compliant data bases in an effective way, regardless their knowledge about the SPARQL or the underlying ontology. Natural language querying have been proposed to address this issue. However it has unavoidably lower accuracy, as compared to systems with graph-based querying interfaces, which, in turn, are usually still too difficult for regular users. This paper presents a search system of user interface that is more user-friendly than widely known graph-based solutions.
Translation method of contextual information into textual space of advertisements BIBAFull-Text 385-386
  Yukihiro Tagami; Toru Hotta; Yusuke Tanaka; Shingo Ono; Koji Tsukamoto; Akira Tajima
Contextual advertising has a key problem to determine how to select the ads that are relevant to the page content and/or the user information. We introduce a translation method that learns a mapping of contextual information to the textual features of ads by using past click data. This method is easy to implement and there is no need to modify an ordinary ad retrieval system because the contextual feature vector is simply transformed into a term vector with the learned matrix. We applied our approach with a real ad serving system and compared the online performance in A/B testing.
Who will trade with whom?: predicting buyer-seller interactions in online trading platforms through social networks BIBAFull-Text 387-388
  Christoph Trattner; Denis Parra; Lukas Eberhard; Xidao Wen
In this paper we present the latest results of a recently started project that aims at studying the extent to which links between buyers and sellers, i.e. trading interactions in online trading platforms, can be predicted from external knowledge sources such as online social networks. To that end, we conducted a large-scale experiment on data obtained from the virtual world Second Life. As our results reveal, online social network data bears a significant potential (28% over the baseline) to predict links between buyers and sellers in online trading platforms.
Towards awareness and control in choreographed user interface mashups BIBAFull-Text 389-390
  Alexey Tschudnowsky; Stefan Pietschmann; Matthias Niederhausen; Martin Gaedke
Recent research in the field of user interface (UI) mashups has focused on so-called choreographed compositions, where communication between components is not pre-defined by a mashup designer, but rather emerges from the components' messaging capabilities. Though the mashup development process gets simplified, such solutions bear several problems related to awareness and control of the emerging message flow. This paper presents an approach to systematically extend choreographed mashups with visualization and tailoring facilities. A first user study demonstrates that usability of the resulting solutions increases if proposed awareness and control facilities are integrated.
Ontology population from web product information BIBAFull-Text 391-392
  Damir Vandic; Lennart J. Nederstigt; Steven S. Aanen; Flavius Frasincar; Frederik Hogenboom
With the vast amount of information available on the Web, there is an increasing need to structure Web data in order to make it accessible to both users and machines. E-commerce is one of the areas in which growing data congestion on the Web has serious consequences. This paper proposes a framework that is capable of populating a product ontology using tabular product information from Web shops. By formalizing product information in this way, better product comparison or recommendation applications could be built. Our approach employs both lexical and syntactic matching for mapping properties and instantiating values. The performed evaluation shows that instantiating consumer electronics from Best Buy and Newegg.com results in an F1 score of approximately 77%.
Construction of tag ontological graphs by locally minimizing weighted average hops BIBAFull-Text 393-394
  Chetan Kumar Verma; Vijay Mahadevan; Nikhil Rasiwasia; Gaurav Aggarwal; Ravi Kant; Alejandro Jaimes; Sujit Dey
We present a data-driven approach for the construction of ontological graphs on a set of image tags obtained from annotated image corpus. We treat each tag as a node in a graph, and starting with a preliminary graph obtained using WordNet, we propose the graph construction as a refinement of the preliminary graph using corpus statistics. Towards this, we formulate an optimization problem which is solved using a local search based approach. To evaluate the constructed ontological graphs, we propose a novel task which involves associating test images with tags while observing partial set of associated tags.
Answering provenance-aware regular path queries on RDF graphs using an automata-based algorithm BIBAFull-Text 395-396
  Xin Wang; Jun Ling; Junhu Wang; Kewen Wang; Zhiyong Feng
This paper presents an automata-based algorithm for answering the provenance-aware regular path queries (RPQs) over RDF graphs on the Semantic Web. The provenance-aware RPQs can explain why pairs of nodes in the classical semantics appear in the result of an RPQ. We implement a parallel version of the automata-based algorithm using the Pregel framework Giraph to efficiently evaluate provenance-aware RPQs on large RDF graphs. The experimental results show that our algorithms are effective and efficient to answer provenance-aware RPQs on large real-world RDF graphs.
3DOC: 3D object CAPTCHA BIBAFull-Text 397-398
  Simon S. Woo; Beomjun Kim; Woochan Jun; Jingul Kim
Current 2D CAPTCHA mechanisms can be easily defeated by character recognition and segmentation attacks by automated machines. Recently, 3D CAPTCHA schemes have been proposed to overcome the weaknesses of 2D CAPTCHA for a few websites. However, researchers also demonstrate the offline pre-processing techniques to break 3D CAPTCHA. In this work, we propose a novel 3D object based CAPTCHA scheme that projects the CAPTCHA image over a 3D object. We develop the prototype and present the proof-of-concept of 3D object based CAPTCHA scheme to protect websites against automated attacks.
Improving query suggestion through noise filtering and query length prediction BIBAFull-Text 399-400
  Liang Wu; Bin Cao; Yuanchun Zhou; Jianhui Li
Clustering-based methods are commonly used in Web search engines for query suggestion. Clustering is useful in reducing the sparseness of data. However, it also introduces noises and ignores the sequential information of query refinements in search sessions. In this paper, we propose to improve cluster based query suggestion from two perspectives: filtering out unrelated query candidates and predicting the refinement direction. We observe two major refinements behaviors. One is to simplify the original query and the other is to specify it. Both could be modeled by predicting the length (number of terms) of queries when candidates are being ranked. Two experimental results on the real query logs of a commercial search engine demonstrate the effectiveness of the proposed approaches.
Detecting in-situ identity fraud on social network services: a case study on Facebook BIBAFull-Text 401-402
  Shan-Hung Wu; Man-Ju Chou; Chun-Hsiung Tseng; Yuh-Jye Lee; Kuan-Ta Chen
In this paper, we propose to use a continuous authentication approach to detect the in-situ identity fraud incidents, which occur when the attackers use the same devices and IP addresses as the victims. Using Facebook as a case study, we show that it is possible to detect such incidents by analyzing SNS users' browsing behavior. Our experiment results demonstrate that the approach can achieve reasonable accuracy given a few minutes of observation time.
Multi-category item recommendation using neighborhood associations in trust networks BIBAFull-Text 403-404
  Feng Xia; Haifeng Liu; Nana Yaw Asabere; Wei Wang; Zhuo Yang
This paper proposes a novel recommendation method called RecDI. In the multi-category item recommendation domain, RecDI is designed to combine user ratings with information involving user's direct and indirect neighborhood associations. Through relevant benchmarking experiments on two real-world datasets, we show that RecDI achieves better performance than other traditional recommendation methods, which demonstrates the effectiveness of RecDI.
Optimizing the most specific concept method for efficient instance checking BIBAFull-Text 405-406
  Jia Xu; Patrick Shironoshita; Ubbo Visser; Nigel John; Mansur Kabuka
Instance checking is considered a central tool for data retrieval from description logic (DL) ontologies. In this paper, we propose a revised most specific concept (MSC) method for DL SHI, which converts instance checking into subsumption problems. This revised method can generate small concepts that are specific-enough to answer a given query, and allow reasoning to explore only a subset of the ABox data to achieve efficiency. Experiments show effectiveness of our proposed method in terms of concept size reduction and the improvement in reasoning efficiency.
Tag propagation based recommendation across diverse social media BIBAFull-Text 407-408
  Deqing Yang; Yanghua Xiao; Yangqiu Song; Junjun Zhang; Kezun Zhang; Wei Wang
Many real applications demand accurate cross-domain recommendation, e.g., recommending a Weibo (the largest Chinese Twitter) user with the products in an e-commerce Web site. Since many social media have rich tags on both items or users, tag-based profiling became popular for recommendation. However, most previous recommendation approaches have low effectiveness in handling sparse data or matching tags from different social media. Addressing these problems, we first propose an optimized local tag propagation algorithm to generate tags for profiling Weibo users and then use a Chinese knowledge graph accompanied by an improved ESA (explicit semantic analysis) for semantic matching of cross-domain tags. Empirical comparisons to the state-of-the-art approaches justify the efficiency and effectiveness of our approaches.
SoRank: incorporating social information into learning to rank models for recommendation BIBAFull-Text 409-410
  Weilong Yao; Jing He; Guangyan Huang; Yanchun Zhang
Most existing learning to rank based recommendation methods only use user-item preferences to rank items, while neglecting social relations among users. In this paper, we propose a novel, effective and efficient model, SoRank, by integrating social information among users into listwise ranking model to improve quality of ranked list of items. In addition, with linear complexity to the number of observed ratings, SoRank is able to scale to very large dataset. Experimental results on publicly available dataset demonstrate the effectiveness of SoRank.
Election trolling: analyzing sentiment in tweets during Pakistan elections 2013 BIBAFull-Text 411-412
  Arjumand Younus; M. Atif Qureshi; Muhammad Saeed; Nasir Touheed; Colm O'Riordan; Gabriella Pasi
The use of Twitter as a discussion platform for political issues has led researchers to study its role in predicting election outcomes. This work studies a much neglected aspect of politics on Twitter namely "election trolling" whereby supporters of different political parties attack each other during election campaigns. We also propose a novel strategy to detect terms that are usually not associated with sentiment but are introduced by supporters of political parties to attack the opposing party. We demonstrate a lack of political maturity as evidenced through high percentage of political attacks in a developing region such as Pakistan.
Topic-STG: extending the session-based temporal graph approach for personalized tweet recommendation BIBAFull-Text 413-414
  Jianjun Yu; Yi Shen; Zhenglu Yang
Micro-blogging is experiencing fantastic success in the worldwide. However, during its rapid development, it has encountered the problem of information overload, which has troubled many users. In this paper, we mainly focus on the task of tweet recommendation to address this problem. We extend the session-based temporal graph (STG) approach as Topic-STG for tweet recommendation which comprehensively considers three types of features in Twitter: the textual information, the time factor, and the users' behavior. The experimental results conducted on a real dataset demonstrate the effectiveness of our approach.
Evolutionary analysis on online social networks using a social evolutionary game BIBAFull-Text 415-416
  Jianye Yu; Yuanzhuo Wang; Xiaolong Jin; Jingyuan Li; Xueqi Cheng
In this paper, we propose a social evolutionary game to investigate the evolution of social networks. Through comparison between simulation and empirical analysis on the social networks of Twitter and Sina Weibo, we validate the effectiveness of the proposed model and estimate the evolutionary phases of the two networks. We find that the users of Sina Weibo can withstand comparatively more costs than the users of Twitter. Therefore, they can perform more positive behavior and consider more about their reputation than Twitter users. Moreover, the evolutionary time of Sina Weibo to a stable state is longer than that of Twitter.
App mining: finding the real value of mobile applications BIBAFull-Text 417-418
  Peng Yu; Ching-man Au Yeung
In this poster, we present a new model for estimating the actual value of mobile apps to the users. The model assumes that users are implicitly evaluating the value of the apps in their smartphones when they choose to uninstall some apps. Our proposed method thus makes use of the install and uninstall log in a mobile app store to estimate the value of the apps. Experiments using data from a popular mobile app store show that our model is better in predicting the future download trend of the apps as well as the future uninstallation rate of the apps. We believe such model will be very useful in generating more credible and appropriate mobile app recommendations to users, or in generating features for machine learning systems in more complex prediction tasks.
Query augmentation based intent matching in retail vertical ads BIBAFull-Text 419-420
  Huasha Zhao; Vivian Zhang; Ye Chen; John Canny; Tak Yan
Search advertising shows trends of vertical extension. Vertical ads typically offer better Return of Investment (ROI) to advertisers as a result of better user engagement. However, campaign and bids in vertical ads are not set at the keyword level. As a result, the matching between user query and ads suffers low recall rate and the match quality is heavily impacted by tail queries. In this paper, we propose a retail ads retrieval framework based on query rewrite using personal history data to improve ads recall rate. To insure ads quality, we also present a relevance model for matching rewritten queries with user search intent, with a particular focus on rare queries. Extensive experiments are carried out on large-scale logs collected from the Bing search engine, and results show our system achieves significant gains in ads retrieval rate without compromising ads quality. To our knowledge, this work is the first attempt to leverage user behavioral data in ad matching and apply it to the vertical ads domain.
An upper bound based greedy algorithm for mining top-k influential nodes in social networks BIBAFull-Text 421-422
  Chuan Zhou; Peng Zhang; Jing Guo; Li Guo
Influence maximization [4] is NP-hard under the Linear Threshold (LT) model, where a line of greedy algorithms have been proposed. The simple greedy algorithm [4] guarantees accuracy rate of 1-1/e to the optimal solution; the advanced greedy algorithm, e.g., the CELF algorithm [6], runs 700 times faster by exploiting the submodular property of the spread function. However, both models lack efficiency due to heavy Monte-Carlo simulations during estimating the spread function. To this end, in this paper we derive an upper bound for the spread function under the LT model. Furthermore, we propose an efficient UBLF algorithm by incorporating the bound into CELF. Experimental results demonstrate that UBLF, compared with CELF, reduces about 98.9% Monte-Carlo simulations and achieves at least 5 times speed-raising when the size of seed set is small.
Maximizing the long-term integral influence in social networks under the voter model BIBAFull-Text 423-424
  Chuan Zhou; Peng Zhang; Wenyu Zang; Li Guo
We address the problem of discovering the influential nodes in social networks under the voter model, which allows multiple activations to the same node, by defining an integral influence maximization problem in a long term. We analyze the problem formulation and present an exact solution to the maximization problem. We also provide a sufficient condition for the convergence of the integral influence. We experimentally compare the exact solution with other heuristic algorithms in the aspects of quality and efficiency.

WWW 2014 websci track

Graph structure in the web -- revisited: a trick of the heavy tail BIBAFull-Text 427-432
  Robert Meusel; Sebastiano Vigna; Oliver Lehmberg; Christian Bizer
Knowledge about the general graph structure of the World Wide Web is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we describe and analyse a large, publicly accessible crawl of the web that was gathered by the Common Crawl Foundation in 2012 and that contains over 3.5 billion web pages and 128.7 billion links. This crawl makes it possible to observe the evolution of the underlying structure of the World Wide Web within the last 10 years: we analyse and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components.
   Our analysis shows that, as evidenced by previous research, some of the features previously observed by Broder et al. are very dependent on artefacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers, very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the "bow-tie structure" is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the web.
   More importantly, statistical testing and visual inspection of size-rank plots show that the distributions of indegree, outdegree and sizes of strongly connected components are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy-tailed. We also provide for the first time accurate measurement of distance-based features, using recently introduced algorithms that scale to the size of our crawl.
The semantic evolution of online communities BIBAFull-Text 433-438
  Matthew Rowe; Markus Strohmaier
Despite their semantic-rich nature, online communities have, to date, largely been analysed through examining longitudinal changes in social networks, community uptake, or simple term-usage and language adoption. As a result, the evolution of communities on a semantic level, i.e. how concepts emerge, and how these concepts relate to previously discussed concepts, has largely been ignored. In this paper we present a graph-based exploration of the semantic evolution of online communities, thereby capturing dynamics of online communities on a conceptual level. We first examine how semantic graphs (concept graphs and entity graphs) of communities evolve, and then characterise such evolution using logistic population growth models. We demonstrate the value of such models by analysing how sample communities evolve and use our results to predict churn rates in community forums.
Inferring international and internal migration patterns from Twitter data BIBAFull-Text 439-444
  Emilio Zagheni; Venkata Rama Kiran Garimella; Ingmar Weber; Bogdan State
Data about migration flows are largely inconsistent across countries, typically outdated, and often inexistent. Despite the importance of migration as a driver of demographic change, there is limited availability of migration statistics. Generally, researchers rely on census data to indirectly estimate flows. However, little can be inferred for specific years between censuses and for recent trends. The increasing availability of geolocated data from online sources has opened up new opportunities to track recent trends in migration patterns and to improve our understanding of the relationships between internal and international migration. In this paper, we use geolocated data for about 500,000 users of the social network website "Twitter". The data are for users in OECD countries during the period May 2011- April 2013. We evaluated, for the subsample of users who have posted geolocated tweets regularly, the geographic movements within and between countries for independent periods of four months, respectively. Since Twitter users are not representative of the OECD population, we cannot infer migration rates at a single point in time. However, we proposed a difference-in-differences approach to reduce selection bias when we infer trends in out-migration rates for single countries. Our results indicate that our approach is relevant to address two longstanding questions in the migration literature. First, our methods can be used to predict turning points in migration trends, which are particularly relevant for migration forecasting. Second, geolocated Twitter data can substantially improve our understanding of the relationships between internal and international migration. Our analysis relies uniquely on publicly available data that could be potentially available in real time and that could be used to monitor migration trends. The Web Science community is well-positioned to address, in future work, a number of methodological and substantive questions that we discuss in this article.
Examining Wikipedia across linguistic and temporal borders BIBAFull-Text 445-450
  Ramine Tinati; Paul Gaskell; Thanassis Tiropanis; Olivier Phillipe; Wendy Hall
The Web has grown to be an integral part of modern society offering novel ways for humans to communicate, interact, and share information. New collaborative platforms are forming which are providing individuals with new communities and knowledge bases and, at the same time, offering insights into human activity for researchers, policy-makers and engineers. On a global scale, the role of cultural and language barriers when studying such phenomena becomes particularly relevant and presents significant challenges: due to insufficient information, it is often hard to establish the cultural or language groups in which individuals belong, while there are technical difficulties in establishing the relevance and in analysing resources in different languages. This paper presents a framework to the end of addressing those issues by leveraging data on the use of Wikipedia. Resources available in different languages are explicitly correlated in Wikipedia along with time-stamped logs of access to its articles. This paper provides a framework to enable temporal page views in Wikipedia to be associated with specific geographic profiles. This framework is then used to examine the exchange of information between the English speaking and Chinese speaking localities and reports initial findings on the role of language and culture in diffusion in this context.
Taking Brazil's pulse: tracking growing urban economies from online attention BIBAFull-Text 451-456
  Carmen Vaca Ruiz; Daniele Quercia; Luca Maria Aiello; Piero Fraternali
Urban resources are allocated according to socio-economic indicators, and rapid urbanization in developing countries calls for updating those indicators in a timely fashion. The prohibitive costs of census data collection make that very difficult. To avoid allocating resources upon outdated indicators, one could partly update or complement them using digital data. It has been shown that it is possible to use social media in developed countries (mainly UK and USA) for such a purpose. Here we show that this is the case for Brazil too. We analyze a random sample of a microblogging service popular in that country and accurately predict the GDPs of 45 Brazilian cities. To make these predictions, we exploit the sociological concept of glocality, which says that economically successful cities tend to be involved in interactions that are both local and global at the same time. We indeed show that a city's glocality, measured with social media data, effectively signals the city's economic well-being.
Analysis of local online review systems as digital word-of-mouth BIBAFull-Text 457-462
  Claudia López; Rosta Farzan
Using a large dataset of Yelp's online reviews for local businesses, we investigate how Word-of-Mouth research can inform the design of local online review systems and how these systems' data can extend our understanding of digital WOM in a local context. In this paper, we analyze how visual cues currently present in Yelp map to WOM concepts. We also show that these concepts are highly related to the perceived usefulness of the local reviews, which is aligned with prior WOM literature. Additionally, we found that local expertise, measured at the level of the neighborhood, strongly correlates with the perceived usefulness of reviews. Our findings augment the understanding of local online WOM and have design implications for local review systems.
Long time no see: the probability of reusing tags as a function of frequency and recency BIBAFull-Text 463-468
  Dominik Kowald; Paul Seitlinger; Christoph Trattner; Tobias Ley
In this paper, we introduce a tag recommendation algorithm that mimics the way humans draw on items in their long-term memory. This approach uses the frequency and recency of previous tag assignments to estimate the probability of reusing a particular tag. Using three real-world folksonomies gathered from bookmarks in BibSonomy, CiteULike and Flickr, we show how incorporating a time-dependent component outperforms conventional "most popular tags" approaches and another existing and very effective but less theory-driven, time-dependent recommendation mechanism. By combining our approach with a simple resource-specific frequency analysis, our algorithm outperforms other well-established algorithms, such as FolkRank, Pairwise Interaction Tensor Factorization and Collaborative Filtering. We conclude that our approach provides an accurate and computationally efficient model of a user's temporal tagging behavior. We demonstrate how effective principles of information retrieval can be designed and implemented if human memory processes are taken into account.
User churn in focused question answering sites: characterizations and prediction BIBAFull-Text 469-474
  Jagat Sastry Pudipeddi; Leman Akoglu; Hanghang Tong
Given a user on a Q&A site, how can we tell whether s/he is engaged with the site or is rather likely to leave? What are the most evidential factors that relate to users churning? Question and Answer (Q&A) sites form excellent repositories of collective knowledge. To make these sites self-sustainable and long-lasting, it is crucial to ensure that new users as well as the site veterans who provide most of the answers keep engaged with the site. As such, quantifying the engagement of users and preventing churn in Q&A sites are vital to improve the lifespan of these sites. We study a large data collection from stackoverflow.com to identify significant factors that correlate with newcomer user churn in the early stage and those that relate to veterans leaving in the later stage. We consider the problem under two settings: given (i) the first k posts, or (ii) first T days of activity of a user, we aim to identify evidential features to automatically classify users so as to spot those who are about to leave. We find that in both cases, the time gap between subsequent posts is the most significant indicator of diminishing interest of users, besides other indicative factors like answering speed, reputation of those who answer their questions, and number of answers received by the user.
The web observatory extension: facilitating web science collaboration through semantic markup BIBAFull-Text 475-480
  Dominic Difranzo; John S. Erickson; Marie Joan Kristine T. Gloria; Joanne S. Luciano; Deborah L. McGuinness; James Hendler
The multi-disciplinary nature of Web Science and the large size and diversity of data collected and studied by its practitioners has inspired a new type of Web resource known as the Web Observatory. Web observatories are platforms that enable researchers to collect, analyze and share data about the Web and to share tools for Web research. At the Boston Web Observatory Workshop 2013, a semantic model for describing Web Observatories was drafted and an extension to the schema.org microdata vocabulary collection was proposed. This paper details our implementation of the proposed extension, and how we have applied it to the Web Observatory Portal created by the Tetherless World Constellation at Rensselaer Polytechnic Institute (TWC RPI). We recognize this effort to be the "first-step" in the construction, evaluation and validation of the Web observatory model and not the final recommendation. Our hope is that this extension recommendation and our initial implementation sparks additional discussion among the Web Science community of on whether such direction enables Web Observatory curators to better expose and explain their individual Web Observatories to others, thereby enabling better collaboration between researchers across the Web Science community.
Modeling patient engagement in peer-to-peer healthcare BIBAFull-Text 481-486
  Elizabeth Sillence; Claire Hardy; Peter R. Harris; Pam Briggs
Patients now turn to other patients online for health information and advice in a phenomenon known as peer-to-peer healthcare. This paper describes a model of patients' peer-to-peer engagement, based upon qualitative studies of three patient or carer groups searching for online information and advice from their health peers. We describe a three-phase process through which patients engage with peer experience (PEx). In phase I (gating) patients determine the suitability and trustworthiness of the material they encounter; in phase II (engagement) they search out information, support and/or advice from others with similar or relevant experience; and in phase III (evaluation) they make judgments about the costs and benefits of engaging with particular websites in the longer term. This model provides a useful framework for understanding web based interactions in different patient groups.
A study of the online profile of enterprise users in professional social networks BIBAFull-Text 487-492
  Alessandro Bozzon; Hariton Efstathiades; Geert-Jan Houben; Robert-Jan Sips
Understanding the impact of corporate information publicly distributed on the Web is becoming more and more crucial. In this paper we report the result of a study that involved 130 IBM employees: we explored the correctness and extent of organisational information that can be observed from the online profiles of a company's employees. Our work contributes new insights to the study of social networks by showing that, even by considering a small fraction of the available online data, it is possible to discover accurate information about an organisation, its structure, and the factors that characterise the social reach of their employees.
Information network or social network?: the structure of the Twitter follow graph BIBAFull-Text 493-498
  Seth A. Myers; Aneesh Sharma; Pankaj Gupta; Jimmy Lin
In this paper, we provide a characterization of the topological features of the Twitter follow graph, analyzing properties such as degree distributions, connected components, shortest path lengths, clustering coefficients, and degree assortativity. For each of these properties, we compare and contrast with available data from other social networks. These analyses provide a set of authoritative statistics that the community can reference. In addition, we use these data to investigate an often-posed question: Is Twitter a social network or an information network? The "follow" relationship in Twitter is primarily about information consumption, yet many follows are built on social ties. Not surprisingly, we find that the Twitter follow graph exhibits structural characteristics of both an information network and a social network. Going beyond descriptive characterizations, we hypothesize that from an individual user's perspective, Twitter starts off more like an information network, but evolves to behave more like a social network. We provide preliminary evidence that may serve as a formal model of how a hybrid network like Twitter evolves.
Mining triadic closure patterns in social networks BIBAFull-Text 499-504
  Hong Huang; Jie Tang; Sen Wu; Lu Liu; Xiaoming fu
A closed triad is a group of three people who are connected with each other. It is the most basic unit for studying group phenomena in social networks. In this paper, we study how closed triads are formed in dynamic networks. More specifically, given three persons, what are the fundamental factors that trigger the formation of triadic closure? There are various factors that may influence the formation of a relationship between persons. Can we design a unified model to predict the formation of triadic closure? Employing a large microblogging network as the source in our study, we formally define the problem and conduct a systematic investigation. The study uncovers how user demographics and network topology influence the process of triadic closure. We also present a probabilistic graphical model to predict whether three persons will form a closed triad in dynamic networks. The experimental results on the microblogging data demonstrate the efficiency of our proposed model for the prediction of triadic closure formation.
Big smog meets web science: smog disaster analysis based on social media and device data on the web BIBAFull-Text 505-510
  Jiaoyan Chen; Huajun Chen; Guozhou Zheng; Jeff Z. Pan; Honghan Wu; Ningyu Zhang
Nowadays, people are increasingly concerned about smog disaster and the caused health hazard. However, the current methods for big smog analysis are usually based on the traditional lagging data sources or merely adopt physical environment observations, which limit the methods' accuracy and usability. The discipline of Web Science, the research fields of which include web of people and web of devices, provides real time web data as well as novel web data analysis approaches. In this paper, both social web data and device web data are proposed for smog disaster analysis. Firstly, we utilize social web data to define and calculate Individual Public Health Indexes (IPHIs) for smog caused health hazard quantification. Secondly, we integrate social web data and device web data to build standard health hazard rating reference and train smog-health models for health hazard prediction. Finally, we apply the rating reference and models to online and location-sensitive smog disaster monitoring, which can better guide people's behaviour and government's strategy design for disaster mitigation.
Indexing and analyzing wikipedia's current events portal, the daily news summaries by the crowd BIBAFull-Text 511-516
  Giang Binh Tran; Mohammad Alrifai
Wikipedia's Current Events Portal (WCEP) is a special part of Wikipedia that focuses on daily summaries of news events. The WikiTimes project provides structured access to WCEP by extracting and indexing all its daily news events. In this paper we study this part of Wikipedia and take a closer look into its content and the community behind it. First, we provide descriptive analysis of the collected news events. Second, we compare between the news summaries created by the WCEP crowd and the ones created by professional journalists on the same topics. Finally, we analyze the revision logs of news events over the past 7 years in order to characterize the WCEP crowd and their activities. The results show that WCEP has reached a stable state in terms of the volume of contributions as well as the size of its crowd, which makes it an important source of news summaries for the public and the research community.
Evolution of Reddit: from the front page of the internet to a self-referential community? BIBAFull-Text 517-522
  Philipp Singer; Fabian Flöck; Clemens Meinhart; Elias Zeitfogel; Markus Strohmaier
In the past few years, Reddit -- a community-driven platform for submitting, commenting and rating links and text posts -- has grown exponentially, from a small community of users into one of the largest online communities on the Web. To the best of our knowledge, this work represents the most comprehensive longitudinal study of Reddit's evolution to date, studying both (i) how user submissions have evolved over time and (ii) how the community's allocation of attention and its perception of submissions have changed over 5 years based on an analysis of almost 60 million submissions. Our work reveals an ever-increasing diversification of topics accompanied by a simultaneous concentration towards a few selected domains both in terms of posted submissions as well as perception and attention. By and large, our investigations suggest that Reddit has transformed itself from a dedicated gateway to the Web to an increasingly self-referential community that focuses on and reinforces its own user-generated image- and textual content over external sources.
Songrium: a music browsing assistance service with interactive visualization and exploration of protect a web of music BIBAFull-Text 523-528
  Masahiro Hamasaki; Masataka Goto; Tomoyasu Nakano
This paper describes a music browsing assistance service, Songrium (http://songrium.jp), which increases user enjoyment when listening to songs and allows visualization and exploration of a "Web of Music". We define a Web of Music in this paper to be a network of "web-native music", which we define in turn to be music that is published, shared, and remixed (has derivative works created) entirely on the web. Songrium was developed as an attempt to realize a Web of Music, by showing relations between both original songs and derivative works and offering an enriched listening experience. Songrium has analyzed over 600,000 music video clips on the most popular Japanese video-sharing service, Niconico, which contains original songs of web-native music and their derivative works such as covers and dance arrangements. Analysis of over 100,000 original songs reveals that over 500,000 derivative works were generated and have contributed to enrich the Web of Music.
Exploring the user-generated content (UGC) uploading behavior on YouTube BIBAFull-Text 529-534
  Jaimie Yejean Park; Jiyeon Jang; Alejandro Jaimes; Chin-Wan Chung; Sung-Hyon Myaeng
YouTube is the world's largest video sharing platform where both professional and non-professional users participate in creating, uploading, and viewing content. In this work, we analyze content in the music category created by the non-professionals, which we refer to as user-generated content (UGC). Non-professional users frequently upload content (UGC) that are parodies, remakes, or covers of the music videos uploaded by professionals, namely the official record labels. Along with the success of official music videos on YouTube, we find the increased participation of users in creating the UGC related to the music videos. In this study, we characterize the UGC uploading behavior in terms of what, where, and when. Furthermore, we measure the relationship between the popularity of the original content and creation of the related UGC. We find that the UGC uploading behavior is different depending on the types of the UGC and across different genres of music videos. We also find that UGC sharing is a highly global activity; popular UGC is created from all over the world despite the fact that the popular music videos originate from a very limited number of locations. Our findings imply that utilizing the information on re-created UGC is important in order to understand and to predict the popularity of the original content.
Hippocampus: answering memory queries using transactive search BIBAFull-Text 535-540
  Michele Catasta; Alberto Tonon; Djellel Eddine Difallah; Gianluca Demartini; Karl Aberer; Philippe Cudre-Mauroux
Memory queries denote queries where the user is trying to recall from his/her past personal experiences. Neither Web search nor structured queries can effectively answer this type of queries, even when supported by Human Computation solutions. In this paper, we propose a new approach to answer memory queries that we call Transactive Search: The user-requested memory is reconstructed from a group of people by exchanging pieces of personal memories in order to reassemble the overall memory, which is stored in a distributed fashion among members of the group. We experimentally compare our proposed approach against a set of advanced search techniques including the use of Machine Learning methods over the Web of Data, online Social Networks, and Human Computation techniques. Experimental results show that Transactive Search significantly outperforms the effectiveness of existing search approaches for memory queries.
Reachable subwebs for traversal-based query execution BIBAFull-Text 541-546
  Olaf Hartig; M. Tamer Özsu
Traversal-based approaches to execute queries over data on the Web have recently been studied. These approaches make use of up-to-date data from initially unknown data sources and, thus, enable applications to tap the full potential of the Web. While existing work focuses primarily on implementation techniques, a principled analysis of subwebs that are reachable by such approaches is missing. Such an analysis may help to gain new insight into the problem of optimizing the response time of traversal-based query engines. Furthermore, a better understanding of characteristics of such subwebs may also inform approaches to benchmark these engines. This paper provides such an analysis. In particular, we identify typical graph-based properties of query-specific reachable subwebs and quantify their diversity. Furthermore, we investigate whether vertex scoring methods (e.g., PageRank) are able to predict query-relevance of data sources when applied to such subwebs.
Bots vs. wikipedians, anons vs. logged-ins BIBAFull-Text 547-548
  Thomas Steiner
Wikipedia is a global crowdsourced encyclopedia that at time of writing is available in 287 languages. Wikidata is a likewise global crowdsourced knowledge base that provides shared facts to be used by Wikipedias. In the context of this research, we have developed an application and an underlying Application Programming Interface (API) capable of monitoring realtime edit activity of all language versions of Wikipedia and Wikidata. This application allows us to easily analyze edits in order to answer questions such as "Bots vs. Wikipedians, who edits more?", "Which is the most anonymously edited Wikipedia?", or "Who are the bots and what do they edit?". To the best of our knowledge, this is the first time such an analysis could be done in realtime for Wikidata and for really all Wikipedias-large and small. Our application is available publicly online at the URL http://wikipedia-edits.herokuapp.com/, its code has been open-sourced under the Apache 2.0 license.
Quantising contribution effort in online communities BIBAFull-Text 549-550
  Grégoire Burel; Yulan He
We describe the Joint Effort-Topic (JET) model and the Author Joint Effort-Topic (aJET) model that estimate the effort required for users to contribute on different topics. We propose to learn word-level effort taking into account term preference over time and use it to set the priors of our models. Since there is no gold standard which can be easily built, we evaluate them by measuring their abilities to validate expected behaviours such as correlations between user contributions and the associated effort.
Spotting misbehaviors in location-based social networks using tensors BIBAFull-Text 551-552
  Evangelos Papalexakis; Konstantinos Pelechrinis; Christos Faloutsos
The proliferation of mobile devices that are capable of estimating their position, has lead to the emergence of a new class of social networks, namely location-based social networks (LBSNs for short). The main interaction between users in an LBSN is location sharing. While the latter can be realized through continuous tracking of a user's whereabouts from the service provider, the majority of LBSNs allow users to voluntarily share their location, through check-ins. LBSNs provide incentives to users to perform check-ins. However, these incentives can also lead to people faking their location, thus, generating false information. In this work, we propose the use of tensor decomposition for spotting anomalies in the check-in behavior of users. To the best of our knowledge, this is the first attempt to model this problem using tensor analysis.
Spatial and temporal patterns of online food preferences BIBAFull-Text 553-554
  Claudia Wagner; Philipp Singer; Markus Strohmaier
Since food is one of the central elements of all human beings, a high interest exists in exploring temporal and spatial food and dietary patterns of humans. Predominantly, data for such investigations stem from consumer panels which continuously capture food consumption patterns from individuals and households. In this work we leverage data from a large online recipe platform which is frequently used in the German speaking regions in Europe and explore (i) the association between geographic proximity and shared food preferences and (ii) to what extent temporal information helps to predict the food preferences of users. Our results reveal that online food preferences of geographically closer regions are more similar than those of distant ones and show that specific types of ingredients are more popular on specific days of the week. The observed patterns can successfully be mapped to known real-world patterns which suggests that existing methods for the investigation of dietary and food patterns (e.g., consumer panels) may benefit from incorporating the vast amount of data generated by users browsing recipes on the Web.
When is it biased?: assessing the representativeness of Twitter's streaming API BIBAFull-Text 555-556
  Fred Morstatter; Jürgen Pfeffer; Huan Liu
Twitter shares a free 1% sample of its tweets through the "Streaming API". Recently, research has pointed to evidence of bias in this source. The methodologies proposed in previous work rely on the restrictive and expensive Firehose to find the bias in the Streaming API data. We tackle the problem of finding sample bias without costly and restrictive Firehose data. We propose a solution that focuses on using an open data source to find bias in the Streaming API.
Haters gonna hate: job-related offenses in Twitter BIBAFull-Text 557-558
  Ricardo Kawase; Patrick Siehndel; Eelco Herder
In this paper, we aim at finding out which users are likely to publicly demonstrate frustration towards their jobs on the microblogging platform Twitter -- we will call these users haters. We show that the profiles of haters have specific characteristics in terms of vocabulary and connections. The implications of these findings may be used for the development of an early alert system that can help users to think twice before they post potentially self-harming content.
Characterizing high-impact features for content retention in social web applications BIBAFull-Text 559-560
  Kaweh Djafari Naini; Ricardo Kawase; Nattiya Kanhabua; Claudia Niederee
One of the core challenges of automatically creating Social Web summaries is to decide which posts to remember, i.e., to consider for summary inclusion and which to forget. Keeping everything would overwhelm the user and would also neglect the often intentionally ephemeral nature of Social Web posts. In this paper, we analyze high-impact features that characterize memorable posts as a first step for this selection process. Our work is based on a user evaluation for discovering human expectations towards content retention.
Topic-based place semantics discovered from microblogging text messages BIBAFull-Text 561-562
  Eunyoung Kim; Hwon Ihm; Sung-Hyon Myaeng
Location-based social network services (LBSNS) such as Foursquare are getting the highlight with the extensive spread of GPS-enabled mobile devices, and a large body of research has been conducted to devise methods for understanding and clustering places. However, in previous studies, the predefined set of semantic categories of places play a critical role in both discovery and evaluation of the results, despite its limited ability to represent the dynamics of the places. We explore beyond the predefined semantic categories of the places and discover topic-based place semantics through the use of Latent Dirichlet Allocation, by extracting topics from the text which people post on site. We also show the proposed method allows for understanding the temporal dynamics of the place semantics. The finding of this study is intended for, but not limited to, context aware services and place recommendation systems.
Has much potential but biased: exploring the scholarly landscape in Twitter BIBAFull-Text 563-564
  Haewoon Kwak; Jong Gun Lee
We explore how research papers are shared in Twitter to understand its potential and limitation of the current practice that measures or predicts the scientific impact of research papers from the web. We track 54 second-level domains offering the top 100 journals listed in Google Scholar and collect 403,165 tweets sharing 75,677 unique research papers by 142,743 users over the course of 135 days. Our findings show the great potential of Twitter as a platform for paper sharing, but at the same time, indicate the limitations of measuring scientific impact through the lens of social media mainly due to the highly skewed and limited attention to few number of top journals.
(Re)integrating the web: beyond 'socio-technical' BIBFull-Text 565-566
  Ramine Tinati; Leslie Carr; Susan Halford; Catherine Pope
Crowd vs. experts: nichesourcing for knowledge intensive tasks in cultural heritage BIBAFull-Text 567-568
  Jasper Oosterman; Alessandro Bozzon; Geert-Jan Houben; Archana Nottamkandath; Chris Dijkshoorn; Lora Aroyo; Mieke H. R. Leyssen; Myriam C. Traub
The results of our exploratory study provide new insights to crowdsourcing knowledge intensive tasks. We designed and performed an annotation task on a print collection of the Rijksmuseum Amsterdam, involving experts and crowd workers in the domain-specific description of depicted flowers. We created a testbed to collect annotations from flower experts and crowd workers and analyzed these in regard to user agreement. The findings show promising results, demonstrating how, for given categories, nichesourcing can provide useful annotations by connecting crowdsourcing to domain expertise.
Learning to rank for joy BIBAFull-Text 569-570
  Claudia Orellana-Rodriguez; Wolfgang Nejdl; Ernesto Diaz-Aviles; Ismail Sengor Altingovde
User-generated content is a growing source of valuable information and its analysis can lead to a better understanding of the users needs and trends. In this paper, we leverage user feedback about YouTube videos for the task of affective video ranking. To this end, we follow a learning to rank approach, which allows us to compare the performance of different sets of features when the ranking task goes beyond mere relevance and requires an affective understanding of the videos. Our results show that, while basic video features, such as title and tags, lead to effective rankings in an affective-less setup, they do not perform as good when dealing with an affective ranking task.
Number frequency on the web BIBAFull-Text 571-572
  Willem Robert van Hage; Thomas Ploeger; Jesper Hoeksema
In this article we investigate the properties of the frequency distribution of numbers on the Web. We work with a part of the Common Crawl dataset comprising 3.8 billion Web documents and a recent dump of the English language Wikipedia. We show that, like words, numbers on the Web follow a Power law distribution, and obey Benford's law of first-digits. We show and explain regularities in the distribution, and compare the regularities in Common Crawl to those in Wikipedia. The comparison stresses which patterns in the frequency distributions follow from human thought.
WebAlive: a new paradigm for bringing things to life on the web BIBAFull-Text 573-574
  Rahul Parundekar
In this poster, we propose WebAlive, a new paradigm that brings any Thing in the realm of reality to 'life' on the Web by creating an entangled virtual existence for it. For example, consider a physical object like the lamp on your desk. As a human, you are aware of its existence. You can see it, touch it, feel it, interact with it (turn it on, o, etc.), investigate its attributes (color, material, etc.). You can even alter or trash it. Now, imagine that the lamp had a virtual counterpart on the Web such that the two of them were entangled with each other. A software agent on the Web can now become aware of its existence, it can perform actions on it, learn about its attributes and even (with the right capability) alter it just how it was accessible to you. Referring to the most broadest definition of the word Thing, we can extend this entanglement concept beyond physical things like devices, objects, people, etc. to all intelligible things like entities, processes, concepts, ideas, etc. By doing so, we can open a new realm of possibilities where we could have digital access to the world around us. This poster provides an introduction to the paradigm and presents an approach to realize it using Web standards.
Predicting ideological friends and foes in Twitter conflicts BIBAFull-Text 575-576
  Zhe Liu; Ingmar Weber
The rise in popularity of Twitter in recent years has in parallel led to an increase in online controversies. To monitor and control such conflicts early on, we design and evaluate a language-agnostic classifier to tell pairs of ideological friends from foes. We build the classifier using features from four different aspects: user-based, interaction-based, relationship-based and conflict-based. By experimenting with three large data sets containing diverse conflicts, we demonstrate the effectiveness of language-agnostic classification of ideological relation, achieving satisfactory results across all three data sets. Such a classifier potentially enables studies of diverse conflicts on Twitter on a large scale.

WWW 2014 developers' track

Testsuite and harness for browser based editing BIBAFull-Text 579-582
  Dave Raggett
The Web is still awkward when it comes to online editing. It is time to fix that, and make it easier for developers to create smarter browser based editing tools. This short paper presents work on a test framework for a cross browser open source library for browser based editing. The aim is to encourage a proliferation of different kinds of browser based editing for a wide range of purposes. The library steps around the considerable variations in the details of browser support for designMode and contentEditable, which have acted as a brake on realizing the full potential for browser based editing.
CrossLanguageSpotter: a library for detecting relations in polyglot frameworks BIBAFull-Text 583-586
  Federico Tomassetti; Giuseppe Rizzo; Raphael Troncy
Nowadays, most of the web frameworks are developed using different programming languages, both for server and client side programmes. The typical scenario includes a general purpose language (e.g. Ruby, Python, Java) used together with different specialized languages: HTML, CSS, Javascript and SQL. All the artifacts are connected via different types of relations, most of which depend on the adopted framework. These cross-language relations are normally not captured by tools which require the developer to learn and to remember those associations in order to understand and maintain the application. This paper describes a library for detecting cross-language relations in polyglot frameworks. The library has been developed to be modular and to be easily integrated in existing IDEs. The library is publicly available at http://github.com/CrossLanguageProject/crosslanguagespotter
A voice-controlled web browser to navigate hierarchical hidden menus of web pages in a smart-tv environment BIBAFull-Text 587-590
  Sungjae Han; Geunseong Jung; Minsoo Ryu; Byung-Uk Choi; Jaehyuk Cha
This paper proposes a new voice web browser that can be operated in smart TV environments. Previous voice web browsers had the limitation of being run under limited conditions; for example, a list of the specific contents of a page was outputted by voice, or the user entered a search term by voice. In our method proposed in this paper, all the hierarchical menu areas on a web page are recognized and controlled with voice keywords so that page navigation according to a menu can be conveniently done in a voice supported web browser. Although many studies have been conducted on web page menu recognition, most of them provide insufficient information to recognize the hierarchical menu structure. In other words, most web pages in recent browsers showed submenus only as a result of a specific user interaction, since these previous studies had no way of recognizing or controlling the submenus. Therefore, in the web browser proposed in this study, a hierarchical menu structure, which is inserted dynamically via user interaction, is recognized and selected by voice, thus making it possible to maneuver on the web page. Furthermore, the core code of the browser is implemented in JavaScript, so it can be flexibly used not only for a web browser on Smart TVs, but also as functional extensions of existing web browsers in a PC environment.
ROSeAnn: taming online semantic annotators BIBAFull-Text 591-594
  Stefano Ortona; Luying Chen; Giorgio Orsi
Named entity extractors are a popular means for enriching documents with semantic annotations. Both the overlap and the increasing diversity in the capabilities and in the vocabularies of the annotators motivate the need for managing and integrating semantic annotations in a coherent and uniform fashion. ROSeAnn is a framework for the management and the reconciliation of semantic annotations. It provides end-users and programmers with a unified view over the results of multiple online and standalone annotators, linking them to an integrated ontology of their vocabularies, and supporting a variety of document formats such as: plain text, live Web pages, and PDF documents. Although ROSeAnn provides two predefined algorithms for conflict resolution -- one supervised, appropriate when representative training data is available, and one unsupervised -- it also allows application developers to define their own integration techniques, as well as extending the pool of annotators as new ones become available.
Relevant change detection: a framework for the precise extraction of modified and novel web-based content as a filtering technique for analysis engines BIBAFull-Text 595-598
  Kevin Borgolte; Christopher Kruegel; Giovanni Vigna
Tracking the evolution of websites has become fundamental to the understanding of today's Internet. The automatic reasoning of how and why websites change has become essential to developers and businesses alike, in particular because the manual reasoning has become impractical due to the sheer number of modifications that websites undergo during their operational lifetime, including but not limited to rotating advertisements, personalized content, insertion of new content, or removal of old content. Prior work in the area of change detection, such as XyDiff, X-Diff or AT&T's internet difference engine, focused mainly on "diffing" XML-encoded literary documents or XML-encoded databases. Only some previous work investigated the differences that must be taken into account to accurately extract the difference between HTML documents for which the markup language does not necessarily describe the content but is used to describe how the content is displayed instead. Additionally, prior work identifies all changes to a website, even those that might not be relevant to the overall analysis goal, in turn, they unnecessarily burden the analysis engine with additional workload.
   In this paper, we introduce a novel analysis framework, the Delta framework, that works by (i) extracting the modifications between two versions of the same website using a fuzzy tree difference algorithm, and (ii) using a machine-learning algorithm to derive a model of relevant website changes that can be used to cluster similar modifications to reduce the overall workload imposed on an analysis engine. Based on this model for example, the tracked content changes can be used to identify ongoing or even inactive web-based malware campaigns, or to automatically learn semantic translations of sentences or paragraphs by analyzing websites that are available in multiple languages.
   In prior work, we showed the effectiveness of the Delta framework by applying it to the detection and automatic identification of web-based malware campaigns on a data set of over 26 million pairs of websites that were crawled over a time span of four months. During this time, the system based on our framework successfully identified previously unknown web-based malware campaigns, such as a targeted campaign infecting installations of the Discuz!X Internet forum software.
Developing web of data applications from the browser BIBAFull-Text 599-602
  Pavel Arapov; Michel Buffa; Amel Ben Othmane
WikiNEXT is a wiki engine 100% written in JavaScript that relies on recent APIs and frameworks. It has been designed to author web applications directly in a web browser, which can exploit the web of data. It combines the functionalities of a semantic wiki with those of a Web-based IDE (Integrated Development Environment) in order to develop web applications in addition to writing classic documents. It gives developers a rich internal API (Application Programming Interface) and provides several functionalities to exploit the web of data. Our approach uses templates, a special type of wiki pages that represent the semantic data model. Templates generate wiki pages with semantic annotations that are stored as quadruplets in a triple store engine. To query this semantic data, we provide a SPARQL endpoint. Screencasts are available on YouTube (look for WikiNEXT).

WWW 2014 industry track

When machines dominate humans: the challenges of mining and consuming machine-generated web mail BIBAFull-Text 605-606
  Yoelle Maarek
In spite of personal communications moving more and more towards social and mobile, especially with younger generations, email traffic continues to grow. This growth is mostly attributed to (non-spam) machine-generated email, which, against common perception, is often extremely valuable. Indeed, together with monthly newsletters that can easily be ignored, inboxes contain flight itineraries, booking confirmations, receipts or invoices that are critical to many users. In this talk, I will discuss the new nature of consumer email, which is dominated by machine-generated messages of highly heterogeneous forms and value. I will show how the change has not been fully recognized yet by my most email clients (as an example, why should there still be a reply option associated with a message coming from a "do-not-reply@" address?). I will introduce some approaches for large-scale mail mining specifically tailored to machine-generated email. I will conclude by discussing possible applications and research directions.
Analyzing behavioral data for improving search experience BIBAFull-Text 607-608
  Pavel Serdyukov
Yandex is one of the largest internet companies in Europe, operating Russia's most popular search engine, generating 62% of all search traffic in Russia, what means processing about 220 million queries from about 22 million users daily. Clearly, the amount and the variety of user behavioral data which we can monitor at search engines is rapidly increasing. Still, we do not always recognize its potential to help us solve the most challenging search problems and do not immediately know the ways to deal with it most effectively both for search quality evaluation and for its improvement. My talk will focus on various practical challenges arising from the need to "grok" search engine users and do something useful with the data they most generously, though almost unconsciously share with us. I will also present some answers to that by overviewing our latest research on user model based retrieval quality evaluation, implicit feedback mining and personalization. I will also summarize the experience we gained from organizing three data mining challenges at the series of workshops on using search click data (WSCD) organized in the scope of WSDM 2012 -- 2014 conferences. These challenges provided a unique opportunity to consolidate and scrutinize the work from search engines' industrial labs on analyzing behavioral data. Each year we publicly shared a fully anonymized dataset extracted from Yandex query logs and asked participants to predict editorial relevance labels of documents using search logs (in 2011), detect search engine switchings in search sessions (in 2012) and personalize web search using the long-term (user history based) and short-term (session-based) user context (in 2013).

Modeling social media: mining big data in social media and the web (MSM 2014)

The pursuit of urban happiness BIBAFull-Text 611-612
  Daniele Quercia
Cities are attracting considerable research interest. The agenda behind smart cities is popular among computer scientists and engineers: new monitoring technologies promise to allocate urban resources (e.g., electricity, clean water, car traffic) more efficiently and, as such, make our cities 'smarter'. This talk offers a rare counterpoint to that dominant efficiency-driven narrative. It is about recent research on the relationship between happiness and cities [1]: which urban elements make people happy? To help answer that question, I built a web game with collaborators at the University of Cambridge in which users are shown ten pairs of urban scenes of London and, for each pair, a user needs to choose which one they consider to be most beautiful, quiet, and happy. Based on user votes, we are able to rank all urban scenes according to these three attributes. We recently analyzed the scenes with ratings using image processing tools [2]. We discovered that the amount of greenery in any given scene is associated with all the three attributes and that cars and fortress-like buildings are associated with sadness (we equated sadness to our measurement for the low end of our 'spectrum' of happiness). In contrast, public gardens and Victorian and red brick houses are associated with happiness.
   Our results (including those about distinctive and memorable areas [3]) all point in the same direction: urban elements that hinder social interactions are undesirable, while elements that increase interactions are the ones that should be integrated by urban planners to retrofit our cities for happiness.
   Now, as a computer scientist, you might wonder: can these findings be used to build better online tools? The answer is a definite 'Yes'! Existing mapping technologies, for example, return shortest directions. To complement them, we are designing new tools that return directions that are not only short but also tend to make urban walkers happy [4]. Another application comes from the mobile world. In mobile settings, geo-referenced content becomes increasingly important, and content about a neighborhood inherently depends on the way the neighborhood is perceived by people: whether it is, for instance, distinctive and beautiful or not. We are designing an application that identifies memorable city pictures by predicting which neighborhoods tend to be beautiful and which tend to make people happy [5].
YouTube monetization: creating user-centric experiences using large scale data BIBAFull-Text 613-614
  Ramesh R. Sarukkai
Over the last 4 years, YouTube has grown from a viral video sharing site to a platform that fuels a win-win ecosystem for video content creators, advertisers and users. A key driving force behind this successful transformation is building out products/platforms that focus and optimize for the user. In this talk, we will talk about user-centric efforts such as launch of user-controlled skippable ads (TrueView Instream), and Dynamic Ad Loads (machine learning system that balances realtime user impact with advertising policies). Leveraging very large amounts of real-time activity data is paramount to successfully building and deploying such user-centric models. We conclude the talk with challenges and opportunities in this important area of real-time user analysis and large data modeling.
Identifying social roles in reddit using network structure BIBAFull-Text 615-620
  Cody Buntain; Jennifer Golbeck
As social networks and the user-generated content that populates them continue to grow in prevalence, size, and influence, understanding how users interact and produce this content becomes increasingly important. Insight into these community dynamics could prove valuable for measuring content trust, providing role-based group recommendations, or evaluating group stability and growth. To this end, we explore user posting behavior on reddit, a large social networking site comprised of many sub-communities in which a user may participate simultaneously. We demonstrate that the well-known "answer-person" role is present in the reddit community, provide an exposition on an automated method for identifying this role based solely on user interactions (foregoing expensive content analysis), and show that users rarely exhibit significant participation in more than one communities.
Mining cross-domain rating datasets from structured data on Twitter BIBAFull-Text 621-624
  Simon Dooms; Toon De Pessemier; Luc Martens
While rating data is essential for all recommender systems research, there are only a few public rating datasets available, most of them years old and limited to the movie domain. With this work, we aim to end the lack of rating data by illustrating how vast amounts of ratings can be unambiguously collected from Twitter. We validate our approach by mining ratings from four major online websites focusing on movies, books, music and video clips. In a short mining period of 2 weeks, close to 3 million ratings were collected. Since some users turned up in more than one dataset, we believe this work to be amongst the first to provide a true cross-domain rating dataset.
Predicting crowd behavior with big public data BIBAFull-Text 625-630
  Nathan Kallus
With public information becoming widely accessible and shared on today's web, greater insights are possible into crowd actions by citizens and non-state actors such as large protests and cyber activism. We present efforts to predict the occurrence, specific timeframe, and location of such actions before they occur based on public data collected from over 300,000 open content web sources in 7 languages, from all over the world, ranging from mainstream news to government publications to blogs and social media. Using natural language processing, event information is extracted from content such as type of event, what entities are involved and in what role, sentiment and tone, and the occurrence time range of the event discussed. Statements made on Twitter about a future date from the time of posting prove particularly indicative. We consider in particular the case of the 2013 Egyptian coup d'état. The study validates and quantifies the common intuition that data on social media (beyond mainstream news sources) are able to predict major events.
On the evolution of social groups during coffee breaks BIBAFull-Text 631-636
  Martin Atzmueller; Andreas Ernst; Friedrich Krebs; Christoph Scholz; Gerd Stumme
This paper focuses on the analysis of group evolution events in networks of face-to-face proximity. First, we analyze statistical properties of group evolution, e.g., individual activity and typical group sizes. Furthermore, we define a set of specific group evolution events. We analyze these using real-world data collected at the LWA 2010 conference using the Conferator system, and discuss patterns according to different phases of the conference.
On the predictability of recurring links in networks of face-to-face proximity BIBAFull-Text 637-642
  Christoph Scholz; Martin Atzmueller; Gerd Stumme
This paper focuses on the predictability of recurring links: These links are generated repeatedly in a network for different forms of social ties, e.g. by face-to-face interactions in offline social networks. In particular, we analyse the predictability of recurring links in networks of face-to-face proximity using several path-based measures, and compare these to network-proximity measures based on the nodes' neighbourhood. Furthermore, we show that the current tie strength is a good predictor for this link prediction task. In addition we show that the removal of weak ties improves the predictability for most of the considered network proximity measures. For our analysis we utilize three real-world datasets collected at different scientific conferences using the Conferator (http://www.conferator.org) system.
Inferring Twitter user locations with 10 km accuracy BIBAFull-Text 643-648
  KyoungMin Ryoo; Sue Moon
Geographic locations of users form an important axis in public polls and localized advertising, but are not available by default. The number of users who make their locations public or use GPS tagging is relatively small, compared to the huge number of users in online social networking services and social media platforms. In this work we propose a new framework to infer a user's main location of activities in Twitter using their textual contents. Our approach is based on a probabilistic generative model that filters local words, employs data binning for scalability, and applies a map projection technique for performance. For Korean Twitter users, we report that 60% of users are identified within 10 km of their locations, a significant improvement over existing approaches.

Public health in the digital age: social media, crowdsourcing and participatory systems (2nd PHDA 2014)

On the ground validation of online diagnosis with Twitter and medical records BIBAFull-Text 651-656
  Todd Bodnar; Victoria C. Barclay; Nilam Ram; Conrad S. Tucker; Marcel Salathé
Social media has been considered as a data source for tracking disease. However, most analyses are based on models that prioritize strong correlation with population-level disease rates over determining whether or not specific individual users are actually sick. Taking a different approach, we develop a novel system for social-media based disease detection at the individual level using a sample of professionally diagnosed individuals. Specifically, we develop a system for making an accurate influenza diagnosis based on an individual's publicly available Twitter data. We find that about half (17/35 = 48.57%) of the users in our sample that were sick explicitly discuss their disease on Twitter. By developing a meta classifier that combines text analysis, anomaly detection, and social network analysis, we are able to diagnose an individual with greater than 99% accuracy even if she does not discuss her health.
Integration and visualization public health dashboard: the medi+board pilot project BIBAFull-Text 657-662
  Patty Kostkova; Stephan Garbin; Justin Moser; Wendy Pan
Traditional public health surveillance systems would benefit from integration with knowledge created by new situation-aware realtime signals from social media, online searches, mobile/sensor networks and citizens' participatory surveillance systems. However, the challenge of threat validation, cross-verification and information integration for risk assessment has so far been largely untackled. In this paper, we propose a new system, medi+board, monitoring epidemic intelligence sources and traditional case-based surveillance to better automate early warning, cross-validation of signals for outbreak detection and visualization of results on an interactive dashboard. This enables public health professionals to see all essential information at a glance. Modular and configurable to any 'event' defined by public health experts, medi+board scans multiple data sources, detects changing patterns and uses a configurable analysis module for signal detection to identify a threat. These can be validated by an analysis module and correlated with other sources to assess the reliability of the event classified as the reliability coefficient which is a real number between zero and one. Events are reported and visualized on the medi+board dashboard which integrates all information sources and can be navigated by a timescale widget. Simulation with three datasets from the swine flu 2009 pandemic (HPA surveillance, Google news, Twitter) demonstrates the potential of medi+board to automate data processing and visualization to assist public health experts in decision making on control and response measures.
Participatory disease detection through digital volunteerism: how the DoctorMe application aims to capture data for faster disease detection in Thailand BIBAFull-Text 663-666
  Patipat Susumpow; Patcharaporn Pansuwan; Nathalie Sajda; Adam W. Crawley
This paper reports the work in progress of incorporating a participatory disease detection mechanism into the existing web- and mobile device application DoctorMe in Thailand. As Southeast Asia has a high likelihood of hosting potential outbreaks of epidemics it is crucial to enable citizens to collectively contribute to improved public health through crowdsourced data, which is currently lacking. This paper focuses foremost on the localised approach, utilizing elements such as gamification, digital volunteerism and personalised health recommendations for participating users. DoctorMe's participatory disease detection approach aims to tap into the accelerating technological landscape in Thailand and to improve personal health and provide valuable data for institutional analysis that may prevent or decrease the impact of infectious disease outbreaks.
System for surveillance and investigation of disease outbreaks BIBAFull-Text 667-668
  Deleer Barazanji; Pär Bjelkmar
Information technology contributes greatly in improving people's health. Through our interaction with different communication channels such as social media, telephone calls or purchasing over-the-counter medicines, we emit signals and leave trails of information related to our health. Information that can be used to understand our health situation. Some of these communication channels are more structured, filtered and suited for evaluation, as a result they require less demanding filtration and analysis than others. One such channel is the Swedish National Telephone Health Service 1177, where professional healthcare personnel assist and give advice to callers. The main aim of this work is to detect point source outbreaks. For this purpose a project called Event-based Surveillance System (ESS) was initiated to develop a system for surveillance and detection using the former mentioned information source. The system is currently running and is used to notify local authorities whenever a deviation in the telephone traffic pattern is recorded.
One health informatics BIBAFull-Text 669-670
  Hans C. Ossebaard
Zoonoses are a class of infectious diseases causing growing concern of health authorities worldwide. Human and economic costs of zoonoses are substantial, especially in low-resource countries. New zoonoses emerge as a consequence of ecological, demographic, cultural, social and behavioral factors. Meanwhile, global antimicrobial resistance increases. This public health threat demands for a new approach to which the concept of "One Health" is emblematical. It emphasizes the interconnectedness of human, animal and environmental health. To protect and improve public health it is imperative that transdisciplinary collaboration and communication takes place between the human and the veterinary domain. This strategy is now widely endorsed by international, regional and national health policy and academic bodies. Nonetheless the contributions of both the social sciences and the new data sciences need more appreciation. Evidence is available that the methods and concepts they provide can budge "One Health".
Volunteer-powered automatic classification of social media messages for public health in AIDR BIBAFull-Text 671-672
  Muhammad Imran; Carlos Castillo
Microblogging platforms such as Twitter have become a valuable resource for disease surveillance and monitoring. Automatic classification can be used to detect disease-related messages and to sort them into meaningful categories. In this paper, we show how the AIDR (Artificial Intelligence for Disaster Response) platform can be used to harvest and perform analysis of tweets in real-time using supervised machine learning techniques. AIDR is a volunteer-powered online social media content classification platform that automatically learns from a set of human-annotated examples to classify tweets into user-defined categories. In addition, it automatically increases classification accuracy as new examples become available. AIDR can be operated through a web interface without the need to deal with the complexity of the machine learning methods used.
Understanding Twitter influence in the health domain: a social-psychological contribution BIBAFull-Text 673-678
  Andrew R. McNeill; Pam Briggs
Twitter can be a powerful tool for the dissemination and discussion of public health information but how can we best describe its influence? In this paper we draw on social-psychological concepts such as social norms, social representations, emotions and rhetoric to explain how influence works both in terms of the spread of information and also its personal impact. Using tweets drawn from a range of health issues, we show that social psychological theory can be used in the qualitative analysis of Twitter data to further our understanding of how health behaviours can be affected by social media discourse.

Simplifying complex networks for practitioners 2014 workshop

Measuring and maximizing group closeness centrality over disk-resident graphs BIBAFull-Text 689-694
  Junzhou Zhao; John C. S. Lui; Don Towsley; Xiaohong Guan
As an important metric in graphs, group closeness centrality measures how close a group of vertices is to all other vertices in a graph, and it is used in numerous graph applications such as measuring the dominance and influence of a node group over the graph. However, when a large-scale graph contains hundreds of millions of nodes/edges which cannot reside entirely in computer's main memory, measuring and maximizing group closeness become challenging tasks. In this paper, we present a systematic solution for efficiently calculating and maximizing the group closeness for disk-resident graphs. Our solution first leverages a probabilistic counting method to efficiently estimate the group closeness with high accuracy, rather than exhaustively computing it in an exact fashion. In addition, we design an I/O-efficient greedy algorithm to find a node group that maximizes group closeness. Our proposed algorithm significantly reduces the number of random accesses to disk, thereby dramatically improving computational efficiency. Experiments on real-world big graphs demonstrate the efficacy of our approach.
Efficient network generation under general preferential attachment BIBAFull-Text 695-700
  James Atwood; Bruno Ribeiro; Don Towsley
Preferential attachment (PA) models of network structure are widely used due to their explanatory power and conceptual simplicity. PA models are able to account for the scale-free degree distributions observed in many real-world large networks through the remarkably simple mechanism of sequentially introducing nodes that attach preferentially to high-degree nodes. The ability to efficiently generate instances from PA models is a key asset in understanding both the models themselves and the real networks that they represent. Surprisingly, little attention has been paid to the problem of efficient instance generation. In this paper, we show that the complexity of generating network instances from a PA model depends on the preference function of the model, provide efficient data structures that work under any preference function, and present empirical results from an implementation based on these data structures. We demonstrate that, by indexing growing networks with a simple augmented heap, we can implement a network generator which scales many orders of magnitude beyond existing capabilities (106 -- 108 nodes). We show the utility of an efficient and general PA network generator by investigating the consequences of varying the preference functions of an existing model. We also provide "quicknet", a freely-available open-source implementation of the methods described in this work.
Finding influential neighbors to maximize information diffusion in Twitter BIBAFull-Text 701-706
  Hyoungshick Kim; Konstantin Beznosov; Eiko Yoneki
The problem of spreading information is a topic of considerable recent interest, but the traditional influence maximization problem is inadequate for a typical viral marketer who cannot access the entire network topology. To fix this flawed assumption that the marketer can control any arbitrary k nodes in a network, we have developed a decentralized version of the influential maximization problem by influencing k neighbors rather than arbitrary users in the entire network. We present several reasonable neighbor selection schemes and evaluate their performance with a real dataset collected from Twitter. Unlike previous studies using network topology alone or synthetic parameters, we use real propagation rate for each node calculated from the Twitter messages during the 2010 UK election campaign. Our experimental results show that information can be efficiently propagated in online social networks using neighbors with a high propagation rate rather than those with a high number of neighbors.
Kindred domains: detecting and clustering botnet domains using DNS traffic BIBAFull-Text 707-712
  Matthew Thomas; Aziz Mohaisen
In this paper we focus on detecting and clustering distinct groupings of domain names that are queried by numerous sets of infected machines. We propose to analyze domain name system (DNS) traffic, such as Non-Existent Domain (NXDomain) queries, at several premier Top Level Domain (TLD) authoritative name servers to identify strongly connected cliques of malware related domains. We illustrate typical malware DNS lookup patterns when observed on a global scale and utilize this insight to engineer a system capable of detecting and accurately clustering malware domains to a particular variant or malware family without the need for obtaining a malware sample. Finally, the experimental results of our system will provide a unique perspective on the current state of globally distributed malware, particularly the ones that use DNS.
Network analysis of university courses BIBAFull-Text 713-718
  Ahmad Slim; Jarred Kozlick; Gregory L. Heileman; Jeff Wigdahl; Chaouki T. Abdallah
Crucial courses have a high impact on students progress at universities and ultimately on graduation rates. Detecting such courses should therefore be a major focus of decision makers at universities. Based on complex network analysis and graph theory, this paper proposes a new framework to not only detect such courses, but also quantify their cruciality. The experimental results conducted using data from the University of New Mexico (UNM) show that the distribution of course cruciality follows a power law distribution. The results also show that the ten most crucial courses at UNM are all in mathematics. Applications of the proposed framework are extended to study the complexity of curricula within colleges, which leads to a consideration of the creation of optimal curricula. Optimal curricula along with the earned letter grades of the courses are further exploited to analyze the student progress. This work is important as it presents a robust framework to ensure the ease of flow of students through curricula with the goal of improving a university's graduation rate.
Classifying latent infection states in complex networks BIBAFull-Text 719-722
  Yeon-sup Lim; Bruno Ribeiro; Don Towsley
In this work, we develop techniques to identify the latent infected nodes in the presence of missing infection time-and-state data. Based on the likely epidemic paths predicted by the simple susceptible-infected epidemic model, we propose a measure (Infection Betweenness Centrality) for uncovering unknown infection states. Our experimental results using machine learning algorithms show that Infection Betweenness Centrality is the most effective feature for identifying latent infected nodes.
Temporal capacity graphs for time-varying mobile networks BIBAFull-Text 723-726
  Xiangming Zhu; Yong Li; Depeng Jin; Pan Hui
With the rapid emergence of applications in mobile networks, understanding and characterizing their properties becomes extremely important. In this paper, from the fundamental model of time-varying graphs, we introduce Temporal Capacity Graphs (TCG), which characterizes the maximum amount of the data that can be transmitted between any two nodes within any time, and consequently reveals the transmission capacity of the whole network. By applying TCG to several realistic mobile networks, we analyze their unique properties. Moreover, using TCG, we reveal the fundamental relationships and tradeoffs between the mobile network settings and system performance.
Complex network comparison using random walks BIBAFull-Text 727-730
  Shan Lu; Jieqi Kang; Weibo Gong; Don Towsley
In this paper, we proposed a network comparison method based on the mathematical theory of diffusion over manifolds using random walks over graphs. We show that our method not only distinguishes between graphs with different degree distributions, but also different graphs with the same degree distributions. We compare the undirected power law graphs generated by Barabasi-Albert model and directed power law graphs generated by Krapivsky's model to the random graphs generated by Erdos-Renyi model. We also compare power law graphs generated by four different generative models with the same degree distribution.
Mal-netminer: malware classification based on social network analysis of call graph BIBAFull-Text 731-734
  Jae-wook Jang; Jiyoung Woo; Jaesung Yun; Huy Kang Kim
In this work, we aim to classify malware using automatic classifiers by employing graph metrics commonly used in social network analysis. First, we make a malicious system call dictionary that consists of system calls found in malware. To analyze the general structural information of malware and measure the influence of system calls found in malware, we adopt social network analysis. Thus, we use social network metrics such as the degree distribution, degree centrality, and average distance, which are implicitly equivalent to distinct behavioral characteristics. Our experiments demonstrate that the proposed system performs well in classifying malware families within each malware class with accuracy greater than 98%. As exploiting the social network properties of system calls found in malware, our proposed method can not only classify the malware with fewer features than previous methods adopting graph features but also enables us to build a quick and simple detection system against malware.
Designing a high-performance mobile cloud web browser BIBAFull-Text 735-736
  Ryan H. Choi; Youngil Choi
A mobile cloud web browser is a web browser that enables mobile devices with constrained resources to support complex web pages by performing most of resource demanding operations on a cloud web server. In this paper, we present a design of a mobile web cloud browser with efficient data structure.
Andro-profiler: anti-malware system based on behavior profiling of mobile malware BIBFull-Text 737-738
  Jae-wook Jang; Jaesung Yun; Jiyoung Woo; Huy Kang Kim
Propagation phenomena in large social networks BIBAFull-Text 739-740
  Meeyoung Cha
Social media and blogging services have become extremely popular. Every day hundreds of millions of users share conversations on random thoughts, emotional expressions, political news, and social issues. Users interact by following each other's updates and passing along interesting pieces of information to their friends. Information therefore can diffuse widely and quickly through social links. Information propagation in networks like Twitter is unique in that traditional media sources and word-of-mouth propagation coexist. The availability of digitally-logged propagation events in social media help us better understand how user influence, tie strength, repeated exposures, conventions, and various other factors come into play in the way people generate and consume information in the modern society. In this talk, I will present several findings on how bad news [9], rumors [8], prominent events [11], conventions [6, 7], tags [1, 4], behaviors [12], and moods [10] propagate in social media based on a large amount of data collected from networks like Twitter, Flickr, Facebook, and Blogosphere. I will talk about the different roles of user types [2] and content types [5] in propagations as well as ways to measure their influence [3]. Among various findings, I will demonstrate that indegree of a user, a well-known measure of popularity, alone can reveal little about the influence.

2014 social news on the web workshop

Challenges of computational verification in social multimedia BIBAFull-Text 743-748
  Christina Boididou; Symeon Papadopoulos; Yiannis Kompatsiaris; Steve Schifferes; Nic Newman
Fake or misleading multimedia content and its distribution through social networks such as Twitter constitutes an increasingly important and challenging problem, especially in the context of emergencies and critical situations. In this paper, the aim is to explore the challenges involved in applying a computational verification framework to automatically classify tweets with unreliable media content as fake or real. We created a data corpus of tweets around big events focusing on the ones linking to images (fake or real) of which the reliability could be verified by independent online sources. Extracting content and user features for each tweet, we explored the fake prediction accuracy performance using each set of features separately and in combination. We considered three approaches for evaluating the performance of the classifier, ranging from the use of standard cross-validation, to independent groups of tweets and to cross-event training. The obtained results included a 81% for tweet features and 75% for user ones in the case of cross-validation. When using different events for training and testing, the accuracy is much lower (up to 58%) demonstrating that the generalization of the predictor is a very challenging issue.
Alethiometer: a framework for assessing trustworthiness and content validity in social media BIBAFull-Text 749-752
  Eva Jaho; Efstratios Tzoannos; Aris Papadopoulos; Nikos Sarris
There are both positive and negative aspects in the use of social media in news and information dissemination. To deal with the negative aspects, such as the spread of rumours and fake news, the flow of information should implicitly be filtered and marked to specific criteria such as credibility, trustworthiness, reputation, popularity, influence, and authenticity. This paper proposes an approach that can enhance trustworthiness and content validity in the presence of information overload. We introduce Alethiometer, a framework for assessing truthfulness in social media that can be used by professional and general news users alike. We present different measures that delve into the detailed analysis of the content, the contributors of the content and the underlying context. We further propose an approach for deriving a single metric that considers, in a unified manner, the quality of a contributor and of the content provided by that contributor. Finally, we present some preliminary statistical results from the examination of a set of 10 million twitter users, that provide useful insights on the characteristics of social media data.
Trends of news diffusion in social media based on crowd phenomena BIBAFull-Text 753-758
  Minkyoung Kim; David Newth; Peter Christen
Information spreads across social media, bringing heterogeneous social networks interconnected and diffusion patterns varied in different topics of information. Studying such cross-population diffusion in various context helps us understand trends of information diffusion in a more accurate and consistent way. In this study, we focus on real-world news diffusion across online social systems such as mainstream news (News), social networking sites (SNS), and blogs (Blog), and we analyze behavioral patterns of the systems in terms of activity, reactivity, and heterogeneity. We found that News is the most active, SNS is the most reactive, and Blog is the most persistent, which governs time-evolving heterogeneity of these systems. Finally, we interpret the discovered crowd phenomena from various angles using our previous model-free and model-driven approaches, showing that the strength and directionality of influence reflect the behavioral patterns of the systems in news diffusion.
Describing and contextualizing events in TV news show BIBAFull-Text 759-764
  José Luis Redondoio Garcia; Laurens De Vocht; Raphael Troncy; Erik Mannens; Rik Van de Walle
Describing multimedia content in general and TV programs in particular is a hard problem. Relying on subtitles to extract named entities that can be used to index fragments of a program is a common method. However, this approach is limited to what is being said in a program and written in a subtitle, therefore lacking a broader context. Furthermore, this type of index is restricted to a flat list of entities. In this paper, we combine the power of non-structured documents with structured data coming from DBpedia to generate a much richer, context aware metadata of a TV program. We demonstrate that we can harvest a rich context by expanding an initial set of named entities detected in a TV fragment. We evaluate our approach on a TV news show.
News from the crowd: grassroots and collaborative journalism in the digital age BIBAFull-Text 765-768
  Jochen Spangenberg; Nicolaus Heise
Information content provided by members of the general public (mostly online) is playing an ever-increasing role in the detection, production and distribution of news. This paper investigates the concepts of (1) grassroots journalism and (2) collaborative journalism. It looks into similarities and differences and shows how both phenomena have been influenced by the emergence of the Internet and digital technologies. Then, the consequences for journalism in general will be analysed. Ultimately, strategies to meet the new challenges are suggested in order to maintain the quality and reliability of news coverage in the future.

Social recommender systems (SRS2014) workshop

Exploring social activeness and dynamic interest in community-based recommender system BIBAFull-Text 771-776
  Bin Yin; Yujiu Yang; Wenhuang Liu
Community-based recommender systems have attracted much research attention. Forming communities allows us to reduce data sparsity and focus on discovering the latent characteristics of communities instead of individuals. Previous work focused on how to detect the community using various algorithms. However, they failed to consider users' social attributes, such as social activeness and dynamic interest, which have strong correlations to users' preference and choice. Intuitively, people have different social activeness in a social network. Ratings from users with high activeness are more likely to be trustworthy. Temporal dynamic of interest is also significant to user's preference. In this paper, we propose a novel community-based framework. We first employ PLSA-based model incorporating social activeness and dynamic interest to discover communities. Then the state-of-the-art matrix factorization method is applied on each of the communities. The experiment results on two real world datasets validate the effectiveness of our method for improving recommendation performance.
Mining user trails in critiquing based recommenders BIBAFull-Text 777-780
  Skanda Raj Vasudevan; Sutanu Chakraborti
Critiquing based recommenders are very commonly used to help users navigate through the product space to find the required product by tweaking/critiquing one or more features. By critiquing a product, the user gives an informative feedback (i.e, which feature needs to be modified) about why they rejected a product and preferred the other one. As a user interacts with such a system, trails are left behind. We propose ways of leveraging these trails to induce preference models of items which can be used to estimate the relative utilities of products which can be used in ranking the recommendations presented to the user. The idea is to effectively complement knowledge of explicit user interactions in traditional social recommenders with knowledge implicitly obtained from trails.
Folksonomy based socially-aware recommendation of scholarly papers for conference participants BIBAFull-Text 781-786
  Feng Xia; Nana Yaw Asabere; Haifeng Liu; Nakema Deonauth; Fengqi Li
Due to the significant proliferation of scholarly papers in both conferences and journals, recommending relevant papers to researchers for academic learning has become a substantial problem. Conferences, in comparison to journals have an aspect of social learning, which allows personal familiarization through various interactions among researchers. In this paper, we improve the social awareness of participants of smart conferences by proposing an innovative folksonomy-based paper recommendation algorithm, namely, Socially-Aware Recommendation of Scholarly Papers (SARSP). Our proposed algorithm recommends scholarly papers, issued by Active Participants (APs), to other Group Profile participants at the same smart conference based on similarity of their research interests. Furthermore, through computation of social ties, SARSP generates effective recommendations of scholarly papers to participants who have strong social ties with an AP. Through a relevant real-world dataset, we evaluate our proposed algorithm. Our experimental results verify that SARSP has encouraging improvements over other existing methods.
Online dating recommendations: matching markets and learning preferences BIBAFull-Text 787-792
  Kun Tu; Bruno Ribeiro; David Jensen; Don Towsley; Benyuan Liu; Hua Jiang; Xiaodong Wang
Recommendation systems for online dating have recently attracted much attention from the research community. In this paper we propose a two-side matching framework for online dating recommendations and design an Latent Dirichlet Allocation (LDA) model to learn the user preferences from the observed user messaging behavior and user profile features. Experimental results using data from a large online dating website shows that two-sided matching improves the rate of successful matches by as much as 45%. Finally, using simulated matching, we show that the LDA model can correctly capture user preferences.
A new correlation-based information diffusion prediction BIBAFull-Text 793-798
  Jong-Ryul Lee; Chin-Wan Chung
For predicting the diffusion process of information, we introduce and analyze a new correlation between the information adoptions of users sharing a friend in online social networks. Based on the correlation, we propose a probabilistic model to estimate the probability of a user's adoption using the naive Bayes classifier. Next, we build a recommendation method using the probabilistic model. Finally, we demonstrate the effectiveness of the proposed method with the data from Flickr and Movielens which are well-known web services. For all cases in the experiments, the proposed method is more accurate than comparison methods.
Are influential writers more objective?: an analysis of emotionality in review comments BIBAFull-Text 799-804
  Lionel Martin; Valentina Sintsova; Pearl Pu
People increasingly rely on other consumers' opinion to make online purchase decisions. Amazon alone provides access to millions of reviews, risking to cause information overload to an average user. Recent research has thus aimed at understanding and identifying reviews that are considered helpful. Most of such works analyzed the structure and connectivity of social networks to identify influential users. We believe that insight about influence can be gained from analyzing the affective content of the text as well as affect intensity. We employ text mining to extract the emotionality of 68,049 hotel reviews in order to investigate how those influencers behave, especially their choice of words. We analyze whether texts with words and phrases indicative of a writer's emotions, moods, and attitudes are more likely to trigger a genuine interest compared to more neutral texts. Our initial hypothesis was that influential writers are more likely to refrain themselves from expressing their sentiments in order to achieve a more perceived objectivity. But contrary to this initial assumption, our study shows that they use more affective words, both in terms of emotion variety and intensity. This work describes the first step towards building a helpfulness prediction algorithm using emotion lexicons.
Tensor-based item recommendation using probabilistic ranking in social tagging systems BIBAFull-Text 805-810
  Noor Ifada; Richi Nayak
A common problem with the use of tensor modeling in generating quality recommendations for large datasets is scalability. In this paper, we propose the Tensor-based Recommendation using Probabilistic Ranking method that generates the reconstructed tensor using block-striped parallel matrix multiplication and then probabilistically calculates the preferences of user to rank the recommended items. Empirical analysis on two real-world datasets shows that the proposed method is scalable for large tensor datasets and is able to outperform the benchmarking methods in terms of accuracy.
Random walks in recommender systems: exact computation and simulations BIBAFull-Text 811-816
  Colin Cooper; Sang Hyuk Lee; Tomasz Radzik; Yiannis Siantos
A recommender system uses information about known associations between users and items to compute for a given user an ordered recommendation list of items which this user might be interested in acquiring. We consider ordering rules based on various parameters of random walks on the graph representing associations between users and items. We experimentally compare the quality of recommendations and the required computational resources of two approaches: (i) calculate the exact values of the relevant random walk parameters using matrix algebra; (ii) estimate these values by simulating random walks. In our experiments we include methods proposed by Fouss et al. and Gori and Pucci, method P3, which is based on the distribution of the random walk after three steps, and method P3a, which generalises P3. We show that the simple method P3 can outperform previous methods and method P3a can offer further improvements. We show that the time- and memory-efficiency of direct simulation of random walks allows application of these methods to large datasets. We use in our experiments the three MovieLens datasets.
Towards a scalable social recommender engine for online marketplaces: the case of apache solr BIBAFull-Text 817-822
  Emanuel Lacic; Dominik Kowald; Denis Parra; Martin Kahr; Christoph Trattner
Recent research has unveiled the importance of online social networks for improving the quality of recommenders in several domains, what has encouraged the research community to investigate ways to better exploit the social information for recommendations. However, there is a lack of work that offers details of frameworks that allow an easy integration of social data with traditional recommendation algorithms in order to yield a straight-forward and scalable implementation of new and existing systems. Furthermore, it is rare to find details of performance evaluations of recommender systems such as hardware and software specifications or benchmarking results of server loading tests. In this paper we intend to bridge this gap by presenting the details of a social recommender engine for online marketplaces built upon the well-known search engine Apache Solr. We describe our architecture and also share implementation details to facilitate the re-use of our approach by people implementing recommender systems. In addition, we evaluate our framework from two perspectives: (a) recommendation algorithms and data sources, and (b) system performance under server stress tests. Using a dataset from the SecondLife virtual world that has both trading and social interactions, we contribute to research in social recommenders by showing how certain social features allow to improve recommendations in online marketplaces. On the platform implementation side, our evaluation results can serve as a baseline to people searching for performance references in terms of scalability, model training and testing trade-offs, real-time server performance and the impact of model updates in a production system.

Temporal web analytics workshop (TempWeb'14)

Multiple media analysis and visualization for understanding social activities BIBAFull-Text 825-826
  Masashi Toyoda
The Web has involved diverse media services, such as blogs, photo/video/link sharing, social networks, and microblogs. These Web media react to and affect realworld events, while the mass media still has big influence on social activities.
   The Web and mass media now affect each other. Our use of media has evolved dynamically in the last decade, and this affects our societal behavior. For instance, the first photo of a plane crash landing during the "Miracle on the Hudson" on January 15, 2009 appeared and spread on Twitter and was then used in TV news. During the "Chelyabinsk Meteor" incident on February 15, 2013, many people reported videos of the incident on YouTube then mass media reused them on TV programs.
   Large scale collection, analysis, and visualization of those multiple media are strongly required for sociology, linguistics, risk management, and marketing researches. We are building a huge scale Japanese web archive, and various analytics engines with a large-scale display wall. Our archive consists of 30 billion web pages crawled for 14 years, 1 billion blog posts for 7 years, and 15 billion tweets for 3 years.
   In this talk, I present several analysis and visualization systems based on network analysis, natural language processing, image processing, and 3 dimensional visualization.
Analyzing temporal characteristics of check-in data BIBAFull-Text 827-832
  Sushma Bannur; Omar Alonso
There is a surge in the use of location activity in social media, in particular to broadcast the change of physical whereabouts. We are interested in analyzing the temporal characteristics of check-ins data from the user's perspective and also at the aggregate level for detecting patterns. In this paper we conduct a large study using check-in data from Facebook to analyze different temporal characteristics in four venue categories (restaurants, movies, shopping, and get-away). We present the results of such study and outline application areas where the conjunction of location and temporal-aware data can help in new search scenarios.
TempoWordNet for sentence time tagging BIBAFull-Text 833-838
  Gaël Harry Dias; Mohammed Hasanuzzaman; Stéphane Ferrari; Yann Mathet
In this paper, we propose to build a temporal ontology, which may contribute to the success of time-related applications. Temporal classifiers are learned from a set of time-sensitive synsets and then applied to the whole WordNet to give rise to TempoWordNet. So, each synset is augmented with its intrinsic temporal value. To evaluate TempoWordNet, we use a semantic vector space representation for sentence temporal classification, which shows that improvements may be achieved with the time-augmented knowledge base against a bag-of-ngrams representation.
Extracting and aggregating temporal events from text BIBAFull-Text 839-844
  Lars Döhling; Ulf Leser
Finding reliable information about a given event from large and dynamic text collections is a topic of great interest. For instance, rescue teams and insurance companies are interested in concise facts about damages after disasters, which can be found in web blogs, newspaper articles, social networks etc. However, finding, extracting, and condensing specific facts is a highly complex undertaking: It requires identifying appropriate textual sources, recognizing relevant facts within the sources, and aggregating extracted facts into a condensed answer despite inconsistencies, uncertainty, and changes over time. In this paper, we present a three-step framework providing techniques and solutions for each of these problems. We tested the feasibility of extracting time-associated event facts using our framework in a comprehensive case study: gathering data on particular earthquakes from web data sources. Our results show that it is, under certain circumstances, possible to automatically obtain reliable and timely data on natural disasters from the web.
NTCIR temporalia: a test collection for temporal information access research BIBAFull-Text 845-850
  Hideo Joho; Adam Jatowt; Roi Blanco
Time is one of the key constructs of information quality. Following an upsurge of research in temporal aspects of information search, it has become clear that the community needs standardized evaluation benchmark for fostering research in Temporal Information Access. This paper introduces Temporalia (Temporal Information Access), a new pilot task run at NTCIR-11 to create re-usable datasets for those who are interested in temporal aspects of search technologies, and discusses its task design in detail.
Infrastructure for supporting exploration and discovery in web archives BIBAFull-Text 851-856
  Jimmy Lin; Milad Gholami; Jinfeng Rao
Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, unlocking the potential of web archives requires tools that support exploration and discovery of captured content. These tools need to be scalable and responsive, and to this end we believe that modern "big data" infrastructure can provide a solid foundation. We present Warcbase, an open-source platform for managing web archives built on the distributed datastore HBase. Our system provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing. Relying on HBase for storage infrastructure simplifies the development of scalable and responsive applications. We describe a service that provides temporal browsing and an interactive visualization based on topic models that allows users to explore archived content.
Wikipedia as a time machine BIBAFull-Text 857-862
  Stewart Whiting; Joemon Jose; Omar Alonso
Wikipedia encyclopaedia projects, which consist of vast collections of user-edited articles covering a wide range of topics, are among some of the most popular websites on internet. With so many users working collaboratively, mainstream events are often very quickly reflected by both authors editing content and users reading articles. With temporal signals such as changing article content, page viewing activity and the link graph readily available, Wikipedia has gained attention in recent years as a source of temporal event information. This paper serves as an overview of the characteristics and past work which support Wikipedia (English, in this case) for time-aware information retrieval research. Furthermore, we discuss the main content and meta-data temporal signals available along with illustrative analysis. We briefly discuss the source and nature of each signal, and any issues that may complicate extraction and use. To encourage further temporal research based on Wikipedia, we have released all the distilled datasets referred to in this paper.
The 4th temporal web analytics workshop (TempWeb'14) BIBAFull-Text 863-864
  Marc Spaniol; Julien Masanès; Ricardo Baeza-Yates
In this paper we give an overview on the 4th Temporal Web Analytics Workshop (TempWeb). The goal of TempWeb is to provide a venue for researchers of all domains (IE/IR, Web mining, etc.) where the temporal dimension opens up an entirely new range of challenges and possibilities. The workshop's ambition is to help shaping a community of interest on the research challenges and possibilities resulting from the introduction of the time dimension in web analysis. Having a dedicated workshop will help, we believe, to take a rich and cross-domain approach to this new research challenge with a strong focus on the temporal dimension. For the fourth time, TempWeb has been organized in conjunction with the International World Wide Web (WWW) conference, being held on April 8, 2014 in Seoul, Korea.

Theory and practice of social machines 2014 workshop

Personal APIs as an enabler for designing and implementing people as social machines BIBAFull-Text 867-872
  Vanilson Buregio; Leandro Nascimento; Nelson Rosa; Silvio Meira
In this paper, we extend the initial classification scheme for Social Machines (SM) by including Personal APIs as a new SM-related topic of research inquiry. Personal APIs basically refer to the use of Open Application Programming Interfaces Open APIs) to programmatically access information about a person (e.g., personal basic info, health-related statistics, busy data) and/or trigger his/her human capabilities in a standardized way. Here, we provide an overview of some existing Personal APIs and show how this approach can be used to enable the design and implementation of people as individual SMs on the Web. A proof-of-concept system that demonstrates these ideas is also outlined in this paper.
A new architecture description language for social machines BIBAFull-Text 873-874
  Leandro Marques do Nascimento; Vanilson A. A. Burégio; Vinicius C. Garcia; Silvio R. L. Meira
The term "Social Machine" (SM) has been commonly used as a synonym for what is known as the programmable web or web 3.0. Some definitions of a Social Machine have already been provided and they basically support the notion of relationships between distributed entities. The type of relationship molds which services would be provided or required by each machine, and under certain complex constraints. In order to deal with the complexity of this emerging web, we present a language that can describe networks of Social Machines, named SMADL -- the Social Machine Architecture Description Language. In few words, SMADL is as a relationship-driven language which can be used to describe the interactions between any number of machines in a multitude of ways, as a means to represent real machines interacting in the real web, such as, Twitter running on top of Amazon AWS or mash-ups built upon Google Maps, and obviously, as a means to represent interactions with other social machines too.
LSCitter: building social machines by augmenting existing social networks with interaction models BIBAFull-Text 875-880
  Dave Murray-Rust; Dave Robertson
We present LSCitter, an implemented framework for supporting human interaction on social networks with formal models of interaction, designed as a generic tool for creating social machines on existing infrastructure. Interaction models can be used to choreograph distributed systems, providing points of coordination and communication between multiple interacting actors. While existing social networks specify how interactions happen -- who messages go to and when, the effects of carrying out actions -- these are typically implicit, opaque and non user-editable. Treating interaction models as first class objects allows the creation of electronic institutions, on which users can then choose the kinds of interaction they wish to engage in, with protocols which are explicit, visible and modifiable. However, there is typically a cost to users to engage with these institutions. In this paper we introduce the notion of "shadow institutions", where actions on existing social networks are mapped onto formal interaction protocols, allowing participants access to computational intelligence in a seamless, zero-cost manner to carry out computation and store information.
Community structure for efficient information flow in 'ToS;DR', a social machine for parsing legalese BIBAFull-Text 881-884
  Reuben Binns; David Matthews
This paper presents a case study of 'Terms-of-Service; Didn't Read', a social machine to curate, parse, and rate website terms and privacy policies. We examine the relationships between its human contributors and machine counterparts to determine community structure and information flow.
The Berners-Lee hypothesis: power laws and group structure in flickr BIBAFull-Text 885-890
  Harry Halpin; Andrea Capocci
An intriguing hypothesis, first suggested by Tim Berners-Lee, is that the structure of online groups should conform to a power law distribution. We believe this is likely a consequence of the Dunbar Number, which is a supposed limit to the number of persistent social contacts a user can have in a group. As preliminary results, we show that the number of contacts of a typical Flickr user, the number of groups a user belongs to, and the size of Flickr groups all follow power law distributions. Furthermore, we find some unexpected differences in the internal structure of public and private Flickr groups. For further research, we further operationalize the Berners-Lee hypothesis to suppose that users with a group membership distribution that follow a power law will produce more contents for social Web systems.
Community-based crowdsourcing BIBAFull-Text 891-896
  Marco Brambilla; Stefano Ceri; Andrea Mauri; Riccardo Volonterio
This paper is focused on community-based crowdsourcing applications, i.e. the ability of spawning crowdsourcing tasks upon multiple communities of performers, thus leveraging the peculiar characteristics and capabilities of the community members. We show that dynamic adaptation of crowdsourcing campaigns to community behaviour is particularly relevant. We demonstrate that this approach can be very effective for obtaining answers from communities, with very different size, precision, delay and cost, by exploiting the social networking relations and the features of the crowdsourcing task. We show the approach at work within the CrowdSearcher platform, which allows configuring and dynamically adapting crowdsourcing campaigns tailored to different communities. We report on an experiment demonstrating the effectiveness of the approach.
Constructed identity and social machines: a case study in creative media production BIBAFull-Text 897-902
  Amy Guy; Ewan Klein
Current discussions of social machines rightly emphasise a human's role as a crucial part of a system rather than a user of a system. The human 'parts' are typically considered in terms of their aggregate outcomes and collective behaviours, but human participants are rarely all equal, even within a small system. We argue that due to the complex nature of online identity, understanding participants in a more granular way is crucial for social machine observation and design. We present the results of a study of the personas portrayed by participants in a social machine that produces creative media content, and discover that inconsistent or misleading representations of individuals do not necessarily undermine the system in which they are participating. We describe a preliminary framework for making sense of human participants in social machines, and the ongoing work that develops this further.
Government as a social machine in an ecosystem BIBAFull-Text 903-904
  Thanassis Tiropanis; Anni Rowland-Campbell; Wendy Hall
The Web is becoming increasingly pervasive throughout all aspects of human activity. As citizens and organisations adopt Web technologies, so governments are beginning to respond by themselves utilising the electronic space. Much of this has been reactive, and there is very little understanding of the impact that Web technologies are having on government systems and processes, let alone a proactive approach to designing systems that can ensure a positive and beneficial societal impact. The ecosystem which encompasses governments, citizens and communities is both evolving and adaptive, and the only way to examine and understand the development of Web-enabled government, and its possible implications, is to consider government itself as a "social machine" within a social machine ecosystem. In this light, there are significant opportunities and challenges for government that this paper identifies.
Introducing the omega-machine BIBAFull-Text 905-908
  Lei Zhang; Thanassis Tiropanis; Wendy Hall; Sung-Hyon Myaeng
In this paper, we propose the Ω-machine model for social machines. By introducing a cluster of "oracles" to a traditional Turing machine, the Ω-machine is capable of describing the interaction between human participants and mechanical machines. We also give two examples of social machines, collective intelligence and rumor spreading, and demonstrate how the general Ω-machine model could be used to simulate their computations.
Working out the plot: the role of stories in social machines BIBAFull-Text 909-914
  Ségolène M. Tarte; David De Roure; Pip Willcox
Although Social Machines do not have yet a formalized definition, some efforts have been made to characterize them from a "machinery" point of view. In this paper, we present a methodology by which we attempt to reveal the sociality of Social Machines; to do so, we adopt the analogy of stories. By assimilating a Social Machine to a story, we can identify the stories within and about that machine and how this storytelling perspective might reveal the sociality of Social Machines. After illustrating this storytelling approach with a few examples, we then propose three axes of inquiry to evaluate the health of a social machine: (1) assessment of the sociality of a Social Machine through evaluation of its storytelling potential and realization; (2) assessment of the sustainability of a Social Machine through evaluation of its reactivity and interactivity; and (3) assessment of emergence through evaluation of the collaboration between authors and of the distributed/mixed nature of authority.
7 billion home telescopes: observing social machines through personal data stores BIBAFull-Text 915-920
  Max Van Kleek; Daniel Alexander Smith; Ramine Tinati; Kieron O'Hara; Wendy Hall; Nigel R. Shadbolt
Web Observatories aim to develop techniques and methods to allow researchers to interrogate and answer questions about society through the multitudes of digital traces people now create. In this paper, we propose that a possible path towards surmounting the inevitable obstacle of personal privacy towards such a goal, is to keep data with individuals, under their own control, while enabling them to participate in Web Observatory-style analyses in situ. We discuss the kinds of applications such a global, distributed, linked network of Personal Web Observatories might have, a few of the many challenges that must be resolved towards realising such an architecture in practice, and finally, our work towards a fundamental reference building block of such a network.

Vertical search relevance 2014 workshop

Seed selection for domain-specific search BIBAFull-Text 923-928
  Pattisapu Nikhil Priyatam; Ajay Dubey; Krish Perumal; Sai Praneeth; Dharmesh Kakadia; Vasudeva Varma
The last two decades have witnessed an exponential rise in web content from a plethora of domains, which has necessitated the use of domain-specific search engines. Diversity of crawled content is one of the crucial aspects of a domain-specific search engine. To a large extent, diversity is governed by the initial set of seed URLs. Most of the existing approaches rely on manual effort for seed selection. In this work we automate this process using URLs posted on Twitter. We propose an algorithm to get a set of diverse seed URLs from a Twitter URL graph. We compare the performance of our approach against the baseline zero similarity seed selection method and find that our approach beats the baseline by a significant margin.

Web APIs and RESTful design 2014 workshop

Pragmatic hypermedia: creating a generic, self-inflating API client for production use BIBAFull-Text 931-936
  Pete Gamache
Hypermedia API design is a method of creating APIs using hyperlinks to represent and publish an API's functionality. Hypermedia-based APIs bring theoretical advantages over many other designs, including the possibility of self-updating, generic API client software. Such hypermedia API clients only lately have come to exist, and the existing hypermedia client space did not compare favorably to custom API client libraries, requiring somewhat tedious manual access to HTTP resources. Nonetheless, the limitations in creating a compelling hypermedia client were few.
   This paper describes the design and implementation of HyperResource, a fully generic, production-ready Ruby client library for hypermedia APIs. The project leverages the inherent practicality of hypermedia design, demonstrates its immediate usefulness in creating self-generating API clients, enumerates several abstractions and strategies that help in creating hypermedia APIs and clients, and promotes hypermedia API design as the easiest option available to an API programmer.
REST to JavaScript for better client-side development BIBAFull-Text 937-942
  Hyunghun Cho; Sukyoung Ryu
In today's Web-centric era, embedded systems become mashup various web services via RESTful web services. RESTful web services use REST APIs that describe actions as resource state transfers via standard HTTP methods such as GET, PUT, POST, and DELETE. While RESTful web services are lightweight and executable on any platforms that support HTTP methods, writing programs composed of only such primitive methods is not a familiar concept to developers. Therefore, no single design strategy for (fully) RESTful APIs works for arbitrary domains, and current REST APIs are system dependent, incomplete, and likely to change. To help sever-side development of REST APIs, several domain-specific languages such as WADL, WSDL 2.0, and RDF provide automatic tools to generate REST APIs. However, client-side developers who often do not know the web services domain and do not understand RESTful web services suffer from the lack of any development help. In this paper, we present a new approach to build JavaScript APIs that are more accessible to client-side developers than REST APIs. We show a case study of our approach that uses JavaScript APIs and their wrapper implementation instead of REST APIs, and we describe the efficiency in the client-side development.
Atomic distributed transactions: a RESTful design BIBAFull-Text 943-948
  Guy Pardon; Cesare Pautasso
The REST architectural style supports the reliable interaction of clients with a single server. However, no guarantees can be made for more complex interactions which require to atomically transfer state among resources distributed across multiple servers. In this paper we describe a lightweight design for transactional composition of RESTful services. The approach -- based on the Try-Cancel/Confirm (TCC) pattern -- does not require any extension to the HTTP protocol. The design assumes that resources are designed to comply with the TCC pattern and ensures that the resources involved in the transaction are not aware of it. It delegates the responsibility of achieving the atomicity of the transaction to a coordinator which exposes a RESTful API.
Seven challenges for RESTful transaction models BIBAFull-Text 949-952
  Nandana Mihindukulasooriya; Miguel Esteban-Gutiérrez; Raúl García-Castro
The REpresentational State Transfer (REST) architectural style describes the design principles that made the World Wide Web scalable and the same principles can be applied in enterprise context to do loosely coupled and scalable application integration. In recent years, RESTful services are gaining traction in the industry and are commonly used as a simpler alternative to SOAP Web Services. However, one of the main drawbacks of RESTful services is the lack of standard mechanisms to support advanced quality-of-service requirements that are common to enterprises. Transaction processing is one of the essential features of enterprise information systems and several transaction models have been proposed in the past years to fulfill the gap of transaction processing in RESTful services. The goal of this paper is to analyze the state-of-the-art RESTful transaction models and identify the current challenges.
Publish data as time consistent web API with provenance BIBAFull-Text 953-958
  Miel Vander Sande; Pieter Colpaert; Tom De Nies; Erik Mannens; Rik Van de Walle
Many organisations publish their data through a Web API. This stimulates use by Web applications, enabling reuse and enrichments. Recently, resource-oriented APIs are increasing in popularity because of their scalability. However, for organisations subject to data archiving, creating such an APIraises certain issues. Often, datasets are stored in different files and different formats. Therefore, tracking revisions is a challenging task and the API has to be custom built. Moreover, standard APIs only provide access to the current state of a resource. This creates time-based inconsistencies when they are combined. In this paper, we introduce an end-to-end solution for publishing a dataset as a time-based versioned REST API, with minimal input of the publisher. Furthermore, it publishes the provenance of each created resource. We propose a technology stack composed of prior work, which versions datasets, generates provenance, creates an API and adds Memento Datetime negotiation.
The W3C web cryptography API: motivation and overview BIBAFull-Text 959-964
  Harry Halpin
The W3C Web Cryptography API is the standard API for accessing cryptographic primitives in Javascript-based environments. We describe the motivations behind the creation of the W3C Web Cryptography API and give a high-level overview with motivating use-cases while addressing objections.
A RESTful API for controlling dynamic streaming topologies BIBAFull-Text 965-970
  Masiar Babazadeh; Cesare Pautasso
Streaming applications have become more and more dynamic and heterogeneous thanks to new technologies which enable platforms like microcontrollers and Web browsers to be able to host part of a streaming topology. A dynamic heterogeneous streaming application should support load balancing and fault tolerance while being capable of adapting and rearranging topologies to user needs at runtime. In this paper we present a REST API to control dynamic heterogeneous streaming applications. By means of resources, their uniform interface and hypermedia we show how it is possible to monitor, change and adapt the deployment configuration of a streaming topology at runtime.
The COMPOSE API for the internet of things BIBAFull-Text 971-976
  Juan Luis Pérez; Álvaro Villalba; David Carrera; Iker Larizgoitia; Vlad Trifa
The COMPOSE project aims to provide an open Marketplace for the Internet of Things as well as the necessary platform to support it. A necessary component of COMPOSE is an API that allows things, COMPOSE users and the platform to communicate. The COMPOSE API allows for things to push data to the platform, the platform to initiate asynchronous actions on the things, and COMPOSE users to retrieve and process data from the things. In this paper we present the design and implementation of the COMPOSE API, as well as a detailed description of the main key requirements that the API must satisfy. The API documentation and the source code for the platform are available online.
An approach for composing RESTful linked services on the web BIBAFull-Text 977-982
  Mahdi Bennara; Michaël Mrissa; Youssef Amghar
In this paper, we present an approach to compose linked services on the Web based on the principles of linked data and REST. Our contribution is a unified method for discovering both the interaction possibilities a service offers and the available semantic links to other services. Our composition engine is implemented as a generic client that allows exploring a service API and interacting with other services to answer user's goal. We rely on a typical scenario in order to illustrate the benefits of our composition approach. We implemented a prototype to demonstrate the applicability of our proposal, experiment and discuss the results obtained.

Web intelligence and communities workshop (WI&C 2014)

Exploring intelligence of web communities BIBAFull-Text 985-990
  Rajendra Akerkar; Pierre Maret; Laurent Vercouter
Web Intelligence is a multidisciplinary area dealing with utilizing data and services over the Web, to create new data and services using Information and Communication Technologies (ICT) and Intelligent techniques. The link to Networking and Web Communities (WCs) is apparent: the Web is a set of nodes, providing and consuming data and services; the permanent or temporary ties and exchanges in-between these nodes build the virtual communities; and the ICT and intelligent techniques influence the modeling and the processes, and it automates (or semi-automate) communication and cooperation. In this paper, we will explore one aspect of (Web) intelligence pertinent to the Web Communities. The "intelligent" features may emerge in a Web community from interactions and knowledge-transmissions between the community members. We will also introduce the WI&C'14 workshop's goal and structure.
Towards web intelligence through the crowdsourcing of semantics BIBAFull-Text 991-992
  Sören Auer; Dimitris Kontokostas
A key success factor for the Web as a whole was and is its participatory nature. We discuss strategies for engaging human-intelligence to make the Web more semantic.
Population dynamics in open source communities: an ecological approach applied to github BIBAFull-Text 993-998
  Pablo Loyola; In-Young Ko
Open Source Software (OSS) has gained high amount of popularity during the last few years. It is becoming used by public and private institutions, even companies release portions of their code to obtain feedback from the community of voluntary developers. As OSS is based on the voluntary contributions of developers, the number of participants represents one of the key elements that impact the quality of the software. In order to understand how the population of contributors evolve over time, we propose a methodology that adapts Lotka-Volterra-based biological models used for describing host-parasite interactions. Experiments based on data from the Github collaborative platform showed that the proposed approach performs effectively in terms of providing an estimation of the population of developers for each project over time.
History-guided conversational recommendation BIBAFull-Text 999-1004
  Yasser Salem; Jun Hong; Weiru Liu
Product recommendation is an important aspect of many e-commerce systems. It provides an effective way to help users navigate complex product spaces. In this paper, we focus on critiquing-based recommenders. We present a new critiquing-based approach, History-Guided Recommendation (HGR), which is capable of using the recommendation pairs (item and critique) or critiques only so far in the current recommendation session to predict the most likely product recommendations and therefore short-cut the sometimes protracted recommendation sessions in standard critiquing approaches. The HGR approach shows a significant improvement in the interactions between the user and the recommender. It also enables successfully accepted recommendations to be made much earlier in the session.
Strengthening collaborative data analysis and decision making in web communities BIBAFull-Text 1005-1010
  Nikos Karacapilidis; Spyros Christodoulou; Manolis Tzagarakis; Georgia Tsiliki; Costas Pappis
Generally speaking, modern research becomes increasingly interdisciplinary and collaborative in nature. Researchers need to collaborate and make decisions by meaningfully assembling, mining and analyzing available large-scale volumes of complex multi-faceted data residing in different sources. At the same time, they need to efficiently and effectively exploit services available over the Web. Arguing that dealing with data-intensive and cognitively complex settings is not a technical problem alone, this paper presents a novel collaboration support platform for Web communities. The proposed solution adopts a hybrid approach that builds on the synergy between machine and human intelligence to facilitate the underlying sense-making and decision making processes. User experience shows that the platform enables stakeholders to make better, more informed and quicker decisions. The functionalities of the proposed platform are described through a real-world case from a biomedical research community.
A semantic web of know-how: linked data for community-centric tasks BIBAFull-Text 1011-1016
  Paolo Pareti; Ewan Klein; Adam Barker
This paper proposes a novel framework for representing community 'know-how' on the Semantic Web. Procedural knowledge generated by web communities typically takes the form of natural language instructions or videos and is largely unstructured. The absence of semantic structure impedes the deployment of many useful applications, in particular the ability to discover and integrate know-how automatically. We discuss the characteristics of community know-how and argue that existing knowledge representation frameworks fail to represent it adequately. We present a novel framework for representing the semantic structure of community know-how and demonstrate the feasibility of our approach by providing a concrete implementation which includes a method for automatically acquiring procedural knowledge for real-world tasks.
How placing limitations on the size of personal networks changes the structural properties of complex networks BIBAFull-Text 1017-1022
  Somayeh Koohborfardhaghighi; Jörn Altmann
People-to-people interactions in the real world and in virtual environments (e.g., Facebook) can be represented through complex networks. Changes of the structural properties of these complex networks are caused through a variety of dynamic processes. While accepting the fact that variability in individual patterns of behavior (i.e., establishment of random or FOAF-type potential links) in social environments might lead to an increase or decrease in the structural properties of a complex network, in this paper, we focus on another factor that may contribute to such changes, namely the size of personal networks. Any personal network comes with the cost of maintaining individual connections. Despite the fact that technology has shrunk our world, there is also a limit to how many close friends one can keep and count on. It is a relatively small number. In this paper, we develop a multi-agent based model to capture, compare, and explain the structural changes within a growing social network (e.g., expanding the social relations beyond one's social circles). We aim to show that, in addition to various dynamic processes of human interactions, limitations on the size of personal networks can also lead to changes in the structural properties of networks (i.e., the average shortest-path length). Our simulation result shows that the famous small world theory of interconnectivity holds true or even can be shrunk, if people manage to utilize all their existing connections to reach other parties. In addition to this, it can clearly be observed that the network's average path length has a significantly smaller value, if the size of personal networks is set to larger values in our network growth model. Therefore, limitations on the size of personal networks in network growth models lead to an increase in the network's average path length.

Web observatory workshop (WOW2014)

The design of a live social observatory system BIBAFull-Text 1025-1030
  Huanbo Luan; Juanzi Li; Maosong Sun; Tat-Seng Chua
With the emergence of social networks and their potential impact on society, many research groups and originations are collecting huge amount of social media data from various sites to serve different applications. These systems offer insights on different facets of society at different moments of time. Collectively they are known as social observatory systems. This paper describes the architecture and implementation of a live social observatory system named 'NExT-Live'. It aims to analyze the live online social media data streams to mine social senses, phenomena, influences and geographical trends dynamically. It incorporates an efficient and robust set of crawlers to continually crawl online social interactions on various social network sites. The data crawled are stored and processed in a distributed Hadoop architecture. It then performs the analysis on these social media streams jointly to generate analytics at different levels. In particular, it generates high-level analytics about the sense of different target entitles, including People, Locations, Topics and Organizations. NExT-Live offers a live observatory platform that enables people to know the happenings of the place in order to lead better life.
Observing the web by understanding the past: archival internet research BIBAFull-Text 1031-1036
  Matthew S. Weber
This paper discusses the challenges and opportunities for using archival Internet data in order to observe a host of social science phenomena. Specifically, this paper introduces HistoryTracker, a new tool for accessing and extracting archived data from the Internet Archive, the largest repository of archived Web data in existence. The HistoryTracker tool serves to create a Web observatory that allows scholars to study the history of the Web. HistoryTracker takes advantages of Hadoop processing capacity, and allows researchers to extract large swaths of archived data into a link list format that can be easily transferred to a number of other analytical tools. A brief illustration of the use of HistoryTracker is presented demonstrating the use of the tool. Finally, a number of continuing research challenges are discussed, and future research opportunities are outlined.
Fluctuation and burst response in social media BIBAFull-Text 1037-1042
  Mizuki Oka; Yasuhiro Hashimoto; Takashi Ikegami
A salient dynamic property of social media is bursting behavior. In this paper, we study bursting behavior in relation to the structure of fluctuation, known as fluctuation-response relation, to reveal the origin of bursts. More specifically, we study the temporal relation between a preceding baseline fluctuation and the successive burst response using a frequency time series of 3,000 keywords on Twitter. We find three types of keyword time series in terms of the fluctuation-response relation. For the first type of keyword, the baseline fluctuation has a positive correlation with the burst size; as the preceding fluctuation increases, the burst size increases. These bursts are caused endogenously as a result of word-of-mouth interactions in a social network; the keyword is sensitive only to the internal context of the system. For the second type, there is a critical threshold in the fluctuation value up to which a positive correlation is observed. Beyond this value, the size of the bursts becomes independent from the fluctuation size. Our analysis shows that this critical threshold emerges because the bursts in the time series are endogenous and exogenous. This type of keyword is sensitive to internal and external stimuli. The third type is mainly bursts caused by exogenous bursts. This type of keyword is mostly sensitive only to external stimuli. These results are useful for characterizing how excitable a keyword is on Twitter and could be used, for example, for marketing purposes.
Humour reactions in crisis: a proximal analysis of Chinese posts on Sina Weibo in reaction to the Salt Panic of March 2011 BIBAFull-Text 1043-1048
  Gareth Paul Beeston; Manuel Leon Urrutia; Caroline Halcrow; Xianni Xiao; Lu Liu; Jinchuan Wang; Jinho Jay Kim; Kunwoo Park
This paper presents an analysis of humour use in Sina Weibo in reaction to the Chinese salt panic, which occurred as a result of the Fukushima disaster in March 2011. Basing the investigation on the humour Proximal Distancing Theory (PDT), and utilising a dataset from Sina Weibo in 2011, an examination of humour reactions is performed to identify the proximal spread of humourous Weibo posts in relation to the consequent salt panic in China. As a result of this method, we present a novel methodology for understanding humour reactions in social media, and provide recommendations on how such a method could be applied to a variety of other social media, crises, cultural and spatial settings.
Zooniverse: observing the world's largest citizen science platform BIBAFull-Text 1049-1054
  Robert Simpson; Kevin R. Page; David De Roure
This paper introduces the Zooniverse citizen science project and software framework, outlining its structure from an observatory perspective: both as an observable web-based system in itself, and as an example of a platform iteratively developed according to real-world deployment and used at scale. We include details of the technical architecture of Zooniverse, including the mechanisms for data gathering across the Zooniverse operation, access, and analysis. We consider the lessons that can be drawn from the experience of designing and running Zooniverse, and how this might inform development of other web observatories.
Visualising data in web observatories: a proposal for visual analytics development & evaluation BIBAFull-Text 1055-1060
  Paul Booth; Wendy Hall; Nicholas Gibbins; Spyros Galanis
Web Observatories use innovative analytic processes to gather insights from observed data and use the Web as a platform for publishing interactive data visualisations. Recordable events associated with interactivity on the Web provide an opportunity to openly evaluate the utility of these artefacts, assessing fitness for purpose and observing their use. The three principles presented in this paper propose a community evaluation approach to innovation in visual analytics and visualisation for Web Observatories through code sharing, the capturing of semantically enriched interaction data and by openly stating the intended goals of all visualisation work. The potential of this approach is exampled with a set of front-end tools suitable for adoption by the majority of Web Observatories as a means of visualising data on the Web as part the shared, open, and community-driven developmental process. The paper outlines the method for capturing user interaction data as a series of semantic events, which can be used to identify improvements in both the structure and functionality of visualisations. Such refinements in user behaviour are proposed as part of a new methodology that introduces Economics as an evaluation tool for visual analytics.
Legal and ethical considerations: step 1b in building a health web observatory BIBAFull-Text 1061-1066
  Marie Joan Kristine T. Gloria; John S. Erickson; Joanne S. Luciano; Dominic DiFranzo; Deborah L. McGuinness
This paper explores the impact of health information technologies, including the Web, on society and advocates for the development of a Health Web Observatory (HWO) to collect, store and analyze new sources of health information. The paper begins with a high-level literature review from across domains to demonstrate the need for a multi-disciplinary pursuit when building web observatories. For as researchers in the social sciences and legal domains have highlighted, data carries assumptions of power, identity, governance, etc., which should not be overlooked. The paper then recommends example legal and ethical questions to consider when building any health web observatory. The goal is to insert social and regulatory concerns much earlier into the WO methodology.
Towards a taxonomy for web observatories BIBAFull-Text 1067-1072
  Ian C. Brown; Wendy Hall; Lisa Harris
In this paper, we propose an initial structure to support a taxonomy for Web Observatories (WO). The work is based on a small sample of cases drawn from the work of the Web Science Trust and the Web Science Institute and reflects aspects of academic, business and government Observatories. Whilst this is early work it is hoped, by drawing broad brushstrokes at the edges of different types of Observatory, that future work based on a more systematic review will refine this model and hence refine our understanding of the nature of Observatories. We also seek here to enhance a faceted classification scheme (which is thought to be weak in the area of visualisation) through the use of simplified concept maps.

Web-based education technologies workshop (WebET 2014)

Addictive links: engaging students through adaptive navigation support and open social student modeling BIBAFull-Text 1075-1076
  Peter Brusilovsky
Empirical studies of adaptive annotation in the educational context have demonstrated that it can help students to acquire knowledge faster, improve learning outcomes, reduce navigational overhead, and encourage non-sequential navigation. Over the last 8 years we have explored a lesser known effect of adaptive annotation -- its ability to significantly increase student engagement in working with non-mandatory educational content. In the presence of adaptive link annotation, students tend to access significantly more learning content; they stay with it longer, return to it more often and explore a wider variety of learning resources. This talk will present an overview of our exploration of the addictive links effect in many course-long studies, which we ran in several domains (C, SQL and Java programming), for several types of learning content (quizzes, problems, interactive examples). The first part of the talk will review our exploration of a more traditional knowledge-based personalization approach and the second part will focus on more recent studies of social navigation and open social student modeling.
Building engagement for MOOC students: introducing support for time management on online learning platforms BIBAFull-Text 1077-1082
  Ilona Nawrot; Antoine Doucet
The main objectives of massive open online courses (MOOC) are to foster knowledge through free high quality learning materials procurement; to create new knowledge through diverse users' interactions with the providing platform; and to empower research on learning. However, MOOC providers are also businesses (either profit or not-for-profit). They are still in the early stages of their development, but sooner or later, in order to secure their existence and assure their longterm growth, they will have to adapt a business model and monetize the services they provide. Nevertheless, despite their popularity MOOCs are characterized by a very high drop-out rate (about 90%), which may turn out to be a problem regardless of the adapted business model. Hence, MOOC providers can either assume the scale benefits to be sufficiently high to ignore the problem of low MOOC completion rate or tackle this problem. In this paper we explore the problem of the high drop-out rate in massive open online courses. First, we identify its main cause by conducting an online survey, namely bad time organization. Secondly, we provide suggestions to reduce the rate. Specifically, we argue that MOOC platforms should not only provide their users with high quality educational materials and interaction facilities. But they should also support and assist the users in their quest for knowledge. Thus, MOOC platforms should provide tools helping them optimize their time usage and subsequently develop metacognitive skills indispensable in proper time management of learning processes.
A web-based degree program in open source education: a case study BIBAFull-Text 1083-1086
  Kumar Saini Sanjeev; Anand S. Senthil; D. Arivudainambi; C. N. Krishnan
In this paper, we describe the details of an interactive online web-based degree program in the area of Computer Science with specialization in Free/Open Source Software (FOSS) that has been successfully running for two years in a leading technological university in India. The subjects taught as well as the tools and platforms used in delivering the course are exclusively FOSS and put together by the university team, as described here. We also describe the details of the program, its goals and purpose, the manner of its implementation, the learnings we have had and the challenges being faced in going forward.
Tutoring from the desktop: facilitating learning through Google+ hangouts BIBAFull-Text 1087-1092
  Neel Guha
Many studies have demonstrated the effectiveness of tutoring as a teaching strategy. Though much attention has recently been focussed on using the web to extend the reach of the university classroom to high achieving students, comparatively less attention has been paid to the potential of the web to bring personalized tutoring to at-risk students. In this paper, we describe Tutoring From the Desktop, a program in which high school students in California use Google Plus Hangouts to tutor students in Kohlapur, India. We show how a simple structured program can be used to overcome the barriers of time-zones, accents and much more.
Crowdcrawling approach for community based plagiarism detection service BIBAFull-Text 1093-1096
  Sergey Butakov
In the era of exponentially growing web and exploding online education the problem of digital plagiarism has become one of the most burning ones in many areas. Efficient internet plagiarism detection tools should have a capacity similar to that of conventional web search engines. This requirement makes commercial plagiarism detection services expensive and therefore less accessible to smaller education institutions. This work-in-progress paper proposes the concept of crowdcrawling as a tool to distribute the most laborious part of the web search among community servers thus providing scalability and sustainability to the community driven plagiarism detection. It outlines roles for community members depending on the resources they are willing to contribute to the service.
Language technologies for enhancement of teaching and learning in writing BIBAFull-Text 1097-1102
  Tak Pang Lau; Shuai Wang; Yuanyuan Man; Chi Fai Yuen; Irwin King
Writing is a vital issue for education as well as a fundamental skill in teaching and learning. With the development of information technologies, more and more professional writing tools emerge. As each of them mostly concentrates on addressing a specific issue, people need a one-stop platform, which could integrate multiply functions. In addition, with the supported concept of e-learning ecosystem for future education, a comprehensive platform will be more promising. Therefore, we introduce VeriGuide Platform, which provides a professional writing toolbox to promote the enhancement of teaching and learning in writing. It contains six vertical components, which could be split into two groups. The first group, Editing Assistance, facilitates students write papers and point out grammar and spelling errors. While the second group, Text Analysis, offers document analysis results, which enables students to achieve further writing improvement with explanatory feedbacks. Furthermore, we could do education data analytics to enhance the efficiency of teaching and learning. Specifically, the Editing Assistance contains well-organized writing and formatting, and grammar and spelling checking, while readability assessment, similarity detection, citation analysis, and sentiment analysis are included in the Text Analysis.
Collective copyright: enabling the natural evolution of content creation in the web era BIBAFull-Text 1103-1108
  Emanuele Lunadei; Christian Valdivia Torres; Erik Cambria
The way people create and share content has radically changed in the past few years thanks to the advent of social networks, web communities, blogs, wikis, and other online collaborative media. Such online social data are continuously growing in a way that makes it difficult to efficiently aggregate them, since they are the expression of a multitude of single content creators that most of the times show only a small percentage of originality. The act of 'sharing' is still tied to a pre-Internet fashion that sees it as a step following (and never preceding) content creation, as enforced by the rules of publishing and copyright. In the Internet era, the pieces of the puzzle of a valuable work might be scattered throughout the whole Web. In order to hinder the obsolete create-then-share trend that is killing creativity and usefulness of the Web, we propose a novel concept of copyright, which allows content to be shared while being created, in a way that they can gain increasing value as they become part of an increasingly richer puzzle.

WebQuality 2014 workshop

Identifying fraudulently promoted online videos BIBAFull-Text 1111-1116
  Vlad Bulakh; Christopher W. Dunn; Minaxi Gupta
Fraudulent product promotion online, including online videos, is on the rise. In order to understand and defend against this ill, we engage in the fraudulent video economy for a popular video sharing website, YouTube, and collect a sample of over 3,300 fraudulently promoted videos and 500 bot profiles that promote them. We then characterize fraudulent videos and profiles and train supervised machine learning classifiers that can successfully differentiate fraudulent videos and profiles from legitimate ones.
Incredible: is (almost) all web content trustworthy? analysis of psychological factors related to website credibility evaluation BIBAFull-Text 1117-1122
  Maria Rafalak; Katarzyna Abramczuk; Adam Wierzbicki
This paper describes the results of a study conducted in February 2013 on Amazon Mechanical Turk aimed at identifying various determinants of credibility evaluations. 2046 adult participants evaluated credibility of websites with diversified trustworthiness reference index. We concentrated on psychological factors that lead to the characteristic positive bias observed in many working social feedback systems on the Internet. We have used International Personality Item Pool (IPIP) and measured the following traits: trust, conformity, risk taking, need for cognition and intellect. Results suggest that trustworthiness and risk taking are factors clearly differentiating people with respect to tendency to overestimate, underestimate and judge accordingly websites' credibility. Intuitively people characterized by high general trust tend to be more generous in their credibility evaluations. On the other hand, people who are more willing to take risk, tend to be more critical of the Internet content. The latter indicates that high credibility evaluations are being treated as a default option, and lower ratings require special conditions. Other, more detailed psychological patterns related to websites' credibility evaluations are described in full paper.
Quality evaluation of social tags according to web resource types BIBAFull-Text 1123-1128
  Lei Li; Chengzhi Zhang
In the social tagging system, users annotate different web resources according to their need of future information organization and retrieval, and users also annotate resources with different types of tags, such as objective tag, subjective tag, self-organized tag and so on. Because every web resource has its own characteristics, the tag types of each web resource are different. According to the web resource, the quality of each tag type is different. We should depend on resource types to evaluate the quality of tag types, in order to provide efficient tag recommendation service and design better user tagging interfaces. In this paper, we firstly selected five web resources, namely the blog, book, image, music and video, to explore the tag types when annotating different resources. Then we chose specific resource and tags to explore the quality of each tag type according to these five web resources and study the relationship between tag type and quality. The conclusion is that the quality of tag types for different web resources is different.
Learning conflict resolution strategies for cross-language Wikipedia data fusion BIBAFull-Text 1129-1134
  Volha Bryl; Christian Bizer
In order to efficiently use the ever growing amounts of structured data on the web, methods and tools for quality-aware data integration should be devised. In this paper we propose an approach to automatically learn the conflict resolution strategies, which is a crucial step in large-scale data integration. The approach is implemented as an extension of the Sieve data quality assessment and fusion framework. We apply and evaluate our approach on the use case of fusing data from 10 language editions of DBpedia, a large-scale structured knowledge base extracted from Wikipedia. We also propose a method for extracting rich provenance metadata for each DBpedia fact, which is later used in data fusion.
Predicting webpage credibility using linguistic features BIBAFull-Text 1135-1140
  Aleksander Wawer; Radoslaw Nielek; Adam Wierzbicki
The article focuses on predicting trustworthiness from textual content of webpages. The recent work Olteanu et al. proposes a number of features (linguistic and social) to apply machine learning methods to recognize trust levels. We demonstrate that this approach can be substantially improved in two ways: by applying machine learning methods to vectors computed, using psychosocial and psycholinguistic features and in a high-dimensional bag-of-words paradigm of word occurrences. Following Olteanu et al., we test the methods in two classification settings, as a 2-class and 3-class scenario, and in a regression setting. In the 3-class scenario, the features compiled by Olteanu et al. achieve weighted precision of 0.63, while the methods proposed in our paper raise it to 0.66 and 0.70. We also examine coefficients of the models in order to discover words associated with low and high trust.

Big graph mining 2014 workshop

Active learning with partially featured data BIBAFull-Text 1143-1148
  Seungwhan Moon; Calvin McCarter; Yu-Hsin Kuo
In this paper, we propose a new active learning algorithm in which the learner chooses the samples to be queried from the unlabeled data points whose attributes are only partially observed. In addition, we propose a cost-driven decision framework where the learner chooses to query either the labels or the missing attributes. This problem statement addresses a common constraint when building large datasets and applying active learning techniques on them, where some of the attributes (including the labels) are significantly harder or more costly to acquire per data point. We take a novel approach to this problem, first by building an imputation model that maps from the partially featured data to the fully featured dimension, and then performing active learning on the projected input space combined with the estimated confidence of inference. We discuss that our approach is flexible and can work with graph mining tasks as well as conventional semi-supervised learning problems. The results suggest that the proposed algorithm facilitates more cost-efficient annotation than the baselines.
From graphs to tables the design of scalable systems for graph analytics BIBAFull-Text 1149-1150
  Joseph E. Gonzalez
From social networks to language modeling, the growing scale and importance of graph data has driven the development of new graph-parallel systems. In this talk, I will review the graph-parallel abstraction and describe how it can be used to express important machine learning and graph analytics algorithms like PageRank and Latent factor models. I will present how systems like GraphLab and Pregel exploit restrictions in the graph-parallel abstraction along with advances in distributed graph representation to efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. Unfortunately, the same restrictions that enable graph-parallel systems to achieve substantial performance gains also limit their ability to express many of the important stages in a typical graph-analytics pipeline. As a consequence, existing approaches to graph-analytics typically compose multiple systems through brittle and costly file interfaces. To fill the need for a holistic approach to graph-analytics we introduce GraphX, which unifies graph-parallel and data-parallel computation under a single API and system. I will show how a simple set of data-parallel operators can be used to express graph-parallel computation and how, by applying a collection of query optimizations derived from our work on graph-parallel systems, we can execute entire graph-analytics pipelines efficiently in a more general data-parallel distributed fault-tolerant system achieving performance comparable to specialized state-of-the-art systems.
Detecting community structure for undirected big graphs based on random walks BIBAFull-Text 1151-1156
  Xiaoming Liu; Yadong Zhou; Chengchen Hu; Xiaohong Guan; Junyuan Leng
Community detection is a common problem in various types of big graphs. It is meaningful to understand the functions and dynamics of networks. The challenges of detecting community for big graphs include high computational cost, no prior information, etc.. In this work, we analyze the process of random walking in graphs, and find out that the weight of an edge gotten by processing the vertices visited by the walker could be an indicator to measure the closeness of vertex connection. Based on this idea, we propose a community detection algorithm for undirected big graphs which consists of three steps, including random walking using a single walker, weight calculating for edges and community detecting. Our algorithm is running in O(n²) without prior information. Experimental results show that our algorithm is capable of detecting the community structure and the overlapping parts of graphs in real-world effectively, and handling the challenges of community detection in big graph era.
A fast approximation for influence maximization in large social networks BIBAFull-Text 1157-1162
  Jong-Ryul Lee; Chin-Wan Chung
This paper deals with a novel research work about a new efficient approximation algorithm for influence maximization, which was introduced to maximize the benefit of viral marketing. For efficiency, we devise two ways of exploiting the 2-hop influence spread which is the influence spread on nodes within 2-hops away from nodes in a seed set. Firstly, we propose a new greedy method for the influence maximization problem using the 2-hop influence spread. Secondly, to speed up the new greedy method, we devise an effective way of removing unnecessary nodes for influence maximization based on optimal seed's local influence heuristics. In our experiments, we evaluate our method with real-life datasets, and compare it with recent existing methods. From experimental results, the proposed method is at least an order of magnitude faster than the existing methods in all cases while achieving similar accuracy.
Processing scientific mesh queries in graph databases BIBAFull-Text 1163-1168
  Alireza Rezaei Mahdiraji; Peter Baumann
In this work-in-progress paper, we model scientific meshes as a multi-graph in Neo4j graph database using the graph property model. We conduct experiments to measure the performance of the graph database solution in processing mesh queries and compare it with GrAL mesh library and PostgreSQL database on synthetic and real mesh datasets. The experiments show that the databases outperform the mesh library. However, each of the databases perform better on specific query type, i.e, the graph database shows the best performance on global path-intensive queries and the relational database on local and field queries. Based on the experiments, we propose a mediator architecture for processing mesh queries by using the three database systems.

2014 big scholarly data: towards the web of scholars workshop

A visual workflow to explore the web of data for scholars BIBAFull-Text 1171-1176
  Anastasia Dimou; Laurens De Vocht; Mathias Van Compernolle; Erik Mannens; Peter Mechant; Rik Van de Walle
As the Web evolves in an integrated and interlinked knowledge space thanks to the growing amount of published Linked Open Data, the need to find solutions that enable the scholars to discover, explore and analyse the underlying research data emerges. Scholars, typically non-expert technology users, lack of in-depth understanding of the underlying semantic technology which limits their ability to interpret and query the data. We present a visual workflow to connect scholars and scientific resources on the Web of Data. We allow scholars to move from exploratory analysis in academic social networks to exposing relations between these resources. We allow them to reveal experts in a particular field and discover relations in and beyond their research communities. This paper aims to evaluate the potential of such a visual workflow to be used by non-expert users to interact with the semantically enriched data and familiarize with the underlying dataset.
Modeling collaboration in academia: a game theoretic approach BIBAFull-Text 1177-1182
  Qiang Ma; S. Muthukrishnan; Brian Thompson; Graham Cormode
In this work, we aim to understand the mechanisms driving academic collaboration. We begin by building a model for how researchers split their effort between multiple papers, and how collaboration affects the number of citations a paper receives, supported by observations from a large real-world publication and citation dataset, which we call the h-Reinvestment model. Using tools from the field of Game Theory, we study researchers' collaborative behavior over time under this model, with the premise that each researcher wants to maximize his or her academic success. We find analytically that there is a strong incentive to collaborate rather than work in isolation, and that studying collaborative behavior through a game-theoretic lens is a promising approach to help us better understand the nature and dynamics of academic collaboration.
Can web presence predict academic performance?: the case of Eötvös university BIBAFull-Text 1183-1188
  László Gulyás; Zsolt Jurányi; Sándor Soós; George Kampis
This paper reports the preliminary results of a project that aims at incorporating the analysis of the web presence (content) of research institutions into the scientometric analysis of these research institutions. The problem is to understand and predict the dynamics of academic activity and resource allocation using web presence. The present paper approaches this problem in two parts. First we develop a crawler and an archive of the web contents obtained from academic institutions, and present an early analysis of the records. Second, we use (currently off-line records to analyze the dynamics of resource allocation. Combination of the two parts is an ambition of ongoing work. The motivation in this study is twofold. First, we strongly believe that independent archiving, indexing and searching of (past) web content is an important task, even with regards to academic web presence. We are particularly interested in studying the dynamics of the "online scientific discourse", based on the assumption that the changing traces of web presence is an important factor that documents the intensity of activity. Second, we maintain that the trend-analysis of scientific activity represents a hitherto unused potential. We illustrate this by a pilot where, using 'offline' longitudinal datasets, we study whether past (i.e. cumulative) success can predict current (and future) activity in academia. Or, in short: do institutions invest and publish in areas where they have been successful? Answer to this question is, we believe, important to understanding and predicting research policies and their changes.
Trust and hybrid reasoning for ontological knowledge bases BIBAFull-Text 1189-1194
  Hui Shi; Kurt Maly; Steven Zeil
Projects such as Libra and Cimple have built systems to capture knowledge in a research community and to respond to semantic queries. However, they lack the support for a knowledge base that can evolve over time while responding to queries requiring reasoning. We consider a semantic web that covers linked data about science research that are being harvested from the Web and are supplemented and edited by community members. We use ontologies to incorporate semantics to detect conflicts and resolve inconsistencies, and to infer new relations or proof statements with a reasoning engine. We consider a semantic web subject to changes in the knowledge base, the underlying ontology or the rule set that governs the reasoning. In this paper we explore the idea of trust where each change to the knowledge base is analyzed as to what subset of the knowledge base can still be trusted. We present algorithms that adapt the reasoner such that, when proving a goal, it does a simple retrieval when it encounters trusted items and backward chaining over untrusted items. We provide an evaluation of our proposed modifications that show that our algorithm is conservative and that it provides significant gains in performance for certain queries.
Query complex graph patterns: tools and applications BIBAFull-Text 1195-1196
  Hanghang Tong
In his world-widely renowned book, Nobel laureate Herbert Simon pointed out that it is more the complexity of the environment, than the complexity of the individual persons, that determines the complex behavior of humans. The emergence of online social network sites and web 2.0 applications provides a new connected environment/context, where people generate, share and search massive human knowledge; and interact and collaborate with each other to collectively perform some complex tasks. In this talk, we focus on how to make sense of the collaboration data in the context of graphs/networks. To be specific, we will introduce a suite of tools for querying complex patterns from such graphs. Exemplar questions we aim to answer include (a) what makes a team more successful than others, and how to find the best replacement if one of its team members becomes unavailable? (b) how to find a group of authors from databases, data mining and bioinformatics and they collaborate with each other in a star-shape? (c) given a set of querying authors of interest, how to find somebody who initiates the research field these querying authors belong to, and how to summarize and visualize the querying authors' (d) how to incorporate users' preference into these complex queries' We will also introduce the computational challenges behind these querying tools and how to remedy them.
An article level metric in the context of research community BIBAFull-Text 1197-1202
  Yu Liu; Zhen Huang; Jing Fang; Yizhou Yan
With the rapid increase of research papers, article-level metrics are of growing importance for helping researchers select papers. Classical metrics have a significant drawback of just using single factor, which limits the effectiveness of assessing papers in different periods after publication. Moreover, with the development of web 2.0, some new factors are introduced to assess papers. So, a novel article level metric in the context of research community (ALM_RC) is proposed. It integrates the impact of different factors comprehensively, because different factors have different time features and can complement each other in different periods after publication. In addition, as a research community is based on certain research directions, it is a relatively stable environment with related journals and scholars contributing their efforts to development of this research field. So in the context of research community, it is consistent, practical and reasonable to calculate the impact of the journals and scholars under relatively fair criteria. Experimental results show the novel metric is effective and robust in assessing papers.
Aligning web collaboration tools with research data for scholars BIBAFull-Text 1203-1208
  Laurens De Vocht; Selver Softic; Erik Mannens; Martin Ebner; Rik Van de Walle
Resources for research are not always easy to explore, and rarely come with strong support for identifying, linking and selecting those that can be of interest to scholars. In this work we introduce a model that uses state-of-the-art semantic technologies to interlink structured research data and data from Web collaboration tools, social media and Linked Open Data. We use this model to build a platform that connects scholars, using their profiles as a starting point to explore novel and relevant content for their research. Scholars can easily adapt to evolving trends by synchronizing new social media accounts or collaboration tools and integrate then with new datasets. We evaluate our approach by a scenario of personalized exploration of research repositories where we analyze real world scholar profiles and compare them to a reference profile.
ACRec: a co-authorship based random walk model for academic collaboration recommendation BIBAFull-Text 1209-1214
  Jing Li; Feng Xia; Wei Wang; Zhen Chen; Nana Yaw Asabere; Huizhen Jiang
Recent academic procedures have depicted that work involving scientific research tends to be more prolific through collaboration and cooperation among researchers and research groups. On the other hand, discovering new collaborators who are smart enough to conduct joint-research work is accompanied with both difficulties and opportunities. One notable difficulty as well as opportunity is the big scholarly data. In this paper, we satisfy the demand of collaboration recommendation through co-authorship in an academic network. We propose a random walk model using three academic metrics as basics for recommending new collaborations. Each metric is studied through mutual paper co-authoring information and serves to compute the link importance such that a random walker is more likely to visit the valuable nodes. Our experiments on DBLP dataset show that our approach can improve the precision, recall rate and coverage rate of recommendation, compared with other state-of-the-art approaches.
Relatedness measures between conferences in computer science: a preliminary study based on DBLP BIBAFull-Text 1215-1220
  Suhendry Effendy; Irvan Jahja; Roland H. C. Yap
A large percentage of the research in computer science is published in conferences and workshops. We propose three methods which compute a "relatedness score" for conferences relative to a pivot conference, usually a top rated conference. We experiment with the DBLP bibliography to show that our relatedness ranking can be used to help understand the basis of conference reputation ratings, determine what conferences are related to an area and the classification of conferences into areas.
Indicators and functionalities of exploitation of academic staff CV using semantic web technologies BIBAFull-Text 1221-1226
  Isaac Lera; Carlos Guerrero; Carlos Juiz
We have transformed five years of curriculum data of our academic staff from relational databases to a semantic model. Thanks to semantic queries, capabilities of NoSQL models, inference reasoners and data mining techniques we obtain knowledge that it improves the personal management of curriculum data, the quality and efficiency of exploitation tasks, and the transparency, dissemination and collaboration with citizens. The huge catalogue of CV data remains an underutilized resource. Private companies such as editorials have robust services based only on publications but academic institutions have the option of integrating other databases related with their staff to obtain more indicators. We analyse the transformation of data, highlighting the mapping process of authors, and we present two ways of exploitation using semantic queries and complex networks. Thus, institutions, researchers and citizens will have a quality data catalogue for diverse studies.
People like us: mining scholarly data for comparable researchers BIBAFull-Text 1227-1232
  Graham Cormode; S. Muthukrishnan; Jinyun Yan
There are many situations that one needs to find comparable others for a given researcher. Examples include finding peer reviewers, programming committees for conferences, and comparable individual asked in recommendation letters for tenure evaluation. The task is often done on an ad hoc and informal basis. In this paper, we address an interesting problem that has not been adequately studied so far: mining cumulated large scale scholarly data to find comparable researchers.
   We propose a standard to quantify the quality of individual's research output, through the quality of publishing venues. We represent a researcher as a sequence of her publication records, and develop methods to compute the distance between two researchers through sequence matching. Multiple variations of distances are considered to target different scenarios. We define comparable relation by the distance, and conduct experiments on a large corpus and demonstrate the effectiveness our methods through examples. In the end of the paper, we identify several promising directions for further work.

Connecting online & offline life workshop (COOL 2014)

Road traffic prediction by incorporating online information BIBAFull-Text 1235-1240
  Tian Zhou; Lixin Gao; Daiheng Ni
Road traffic conditions are typically affected by events such as extreme weather or sport games. With the advance of Web, events and weather conditions can be readily retrieved in real-time. In this paper, we propose a traffic condition prediction system incorporating both online and offline information. RFID-based system has been deployed for monitoring road traffic. By incorporating data from both road traffic monitoring system and online information, we propose a hierarchical Bayesian network to predict road traffic condition. Using historical data, we establish a hierarchical Bayesian network to characterize the relationships among events and road traffic conditions. To evaluate the model, we use the traffic data collected in Western Massachusetts as well as online information about events and weather. Our proposed prediction achieves an accuracy of 93% overall.
Evolutionary habits on the web BIBAFull-Text 1241-1242
  Daniele Quercia
For the last few years, I and my colleagues have been exploring the complex relationship between our offline and online worlds. This talk will show that, as online platforms become mature, the social behavior we have evolved over thousands of years is reflected on our actions on the web as well. It turns out that, in the context of social influence, finding the "(special) many" (of those who are able to spot trends early one) is more important than trying to find the "special few"[10]; that people with different personality traits take on different roles on both Twitter and Facebook [5,6]; that language, with its vocabulary and prescribed ways of communicating, is a symbolic resource that can be used on its own to influence others [4]; and that a Facebook relationship is more likely to break if it is not embedded in the same social circle, if it is between two people whose ages differ, and if one of the two is neurotic or introvert [3]. Interestingly, we also found that a relationship with a common female friend is more robust than that with a common male friend. More recently, we have also explored the relationship between offline and online worlds in the urban context. We have considered hypotheses put forward in the 1970s urban sociology literature [1,2] and, for the first time, we have been able to test them at scale. We have done so by building two crowdsourcing web games: one crowdsources Londoners' mental images of the city [8], and the other crowdsources the discovery of the urban elements that make people happy [7]. We have found that, as opposed to well-to-do areas, those suffering from social problems are rarely present in residents' mental maps of the city, and they tend to be characterized more by cars and fortress-like buildings than by greenery. This talk will conclude by showing how combining both web games with Flickr offers interesting applications for discovering emotionally-pleasant routes [9] and for ranking city pictures [11].
Can social media help us reason about mental health? BIBAFull-Text 1243-1244
  Munmun De Choudhury
Millions of people each year suffer from depression, which makes mental illness one of the most serious and widespread health challenges in our society today. There is therefore a need for effective policies, interventions, and prevention strategies that enable early detection and diagnosis of mental health concerns in populations. This talk reports some findings on the potential of leveraging social media postings as a new type of lens in understanding mental illness in individuals and populations. Information gleaned from social media bears potential to complement traditional survey techniques in its ability to provide finer grained measurements of behavior over time while radically expanding population sample sizes. The talk highlights how this research direction may be useful in developing tools for identifying the onset of depressive disorders, for use by healthcare agencies; or on behalf of individuals, enabling those suffering from mental illness to be more proactive about their mental health.
Understanding toxic behavior in online games BIBAFull-Text 1245-1246
  Haewoon Kwak
With the remarkable advances from isolated console games to massively multi-player online role-playing games, the online gaming world provides yet another place where people interact with each other. Online games have attracted attention from researchers, because i) the purpose of actions is relatively clear, and ii) actions are quantifiable. A wide range of predefined actions for supporting social interaction (e.g., friendship, communication, trade, enmity, aggression, and punishment) reflects either positive or negative connotations among game players, and is unobtrusively recorded by the game servers. These rich electronic footprints have become invaluable assets for the research of social dynamics.
   In particular, exploring negative behavior in online games is a key research direction because it directly influences gaming experience and user satisfaction. Even a few negative players can impact many others because of the design of multi-player games. For this reason these players are called toxic. The definition of toxic play is not cut and dry. Even if someone follows the game rules, he could be considered toxic. For example, killing one player repetitively is often deemed toxic behavior, although it does not break game rules at all. The vagueness of toxicity makes it hard to understand, detect, and prevent it.
   League of Legends (LoL), created by Riot Games with 70 million users as of 2012, offers a new way to understand toxic behavior. Riot Games develops a crowdsourcing framework, the Tribunal, to judge whether reported toxic behavior should be punished or not. Volunteered players review user reports and vote for either pardon or punishment. As of March 2013, 105 million votes had been collected in North America and Europe.
   We explore toxic playing and reaction based on large-scale data from the Tribunal[1]. We collect and investigate over 10 million user reports on 1.46 million toxic players and corresponding crowdsourced decisions made in the Tribunal. We crawl data from three different regions, North America, Western Europe, and Korea, to take regional differences of user behavior into account. To obtain the comprehensive view of toxic playing and reaction based on huge data collection, we answer following research questions in a bottom-up approach: how individuals react to toxic players, how teams interact with toxic players, how general toxic or non-toxic players behave across the match, and how crowds make a decision on toxic players. We find large-scale empirical support for some notoriously difficult theories to test in the wild, which are bystander effect, ingroup favoritism, black sheep effect, cohesion-performance relationships, and attribution theory. We also discover that regional differences affect the likelihood of being reported and the proportion of being punished of toxic players in the Tribunal.
   We then propose a supervised learning approach for predicting crowdsourced decisions on toxic behavior with large-scale labeled data collections[2]. Using the same sparse information available to the reviewers, we trained classifiers to detect the presence, and severity of toxicity. We built several models oriented around in-game performance, reports by victims of toxic behavior, and linguistic features of chat messages. We found that training with high agreement decisions resulted in more accuracy on low agreement decisions and that our classifier was adept in detecting clear cut innocence. Finally, we showed that our classifier is relatively robust across cultural regions; our classifier built from a North American dataset performed adequately on a European dataset.
   Ultimately, our work can be used as a foundation for the further study of toxic behavior.
Do you know the speaker?: an online experiment with authority messages on event websites BIBAFull-Text 1247-1252
  Kwan Hui Lim; Binyan Jiang; Ee-Peng Lim; Palakorn Achananuparp
With the widespread adoption of the Web, many companies and organizations have established websites that provide information and support online transactions (e.g., buying products or viewing content). Unfortunately, users have limited attention to spare for interacting with online sites. Hence, it is of utmost importance to design sites that attract user attention and effectively guide users to the product or content items they like. Thus, we propose a novel and scalable experimentation approach to evaluate the effectiveness of online site designs. Our case study focuses on the effects of an authority message on visitors' browsing behavior on workshop and seminar online announcement sites. An authority message emphasizes a particular prominent speaker and his/her achievements. Through dividing users into control and treatment groups and carefully tracking their online activities, we observe that the authority message influences the way users interact with page elements on the website and increases their interests in the authority speakers.
A behavior observation tool (BOT) for mobile device network connection logs BIBAFull-Text 1253-1258
  Ting Wang
With the advances of sensory, satellite and mobile communication technologies in recent decades, locational data become widely available. A lot of work has been developed to find useful information from these data, and various approaches have been proposed. In this work, we aim to use one specific type of locational data -- network connection logs of mobile devices, which is widely available and easily accessible to telecom companies, to identify and extract active areas of users. This is a challenging topic due to the existence of inaccurate location and fluctuating log time intervals of this kind of data. In order to observe user behavior from this kind of data set, we propose a new algorithm, namely Behavior Observation Tool (BOT), which uses Convex Hull Algorithm with sliding time windows to model the user's movement, and thus knowledge about the user's lifestyle and habits can extracted from the mobile device network logs.
The social, economic and sexual networks of prostitution BIBFull-Text 1259-1260
  Petter Holme
Inferring offline hierarchical ties from online social networks BIBAFull-Text 1261-1266
  Mohammad Jaber; Peter T. Wood; Panagiotis Papapetrou; Sven Helmer
Social networks can represent many different types of relationships between actors, some explicit and some implicit. For example, email communications between users may be represented explicitly in a network, while managerial relationships may not. In this paper we focus on analyzing explicit interactions among actors in order to detect hierarchical social relationships that may be implicit. We start by employing three well-known ranking-based methods, PageRank, Degree Centrality, and Rooted-PageRank (RPR) to infer such implicit relationships from interactions between actors. Then we propose two novel approaches which take into account the time-dimension of interactions in the process of detecting hierarchical ties. We experiment on two datasets, the Enron email dataset to infer manager-subordinate relationships from email exchanges, and a scientific publication co-authorship dataset to detect PhD advisor-advisee relationships from paper co-authorships. Our experiments show that time-based methods perform considerably better than ranking-based methods. In the Enron dataset, they detect 48% of manager-subordinate ties versus 32% found by Rooted-PageRank. Similarly, in co-author dataset, they detect 62% of advisor-advisee ties compared to only 39% by Rooted-PageRank.
Connecting dream networks across cultures BIBAFull-Text 1267-1272
  Onur Varol; Filippo Menczer
Many species dream, yet there remain many open research questions in the study of dreams. The symbolism of dreams and their interpretation is present in cultures throughout history. Analysis of online data sources for dream interpretation using network science leads to understanding symbolism in dreams and their associated meaning. In this study, we introduce dream interpretation networks for English, Chinese and Arabic that represent different cultures from various parts of the world. We analyze communities in these networks, finding that symbols within a community are semantically related. The central nodes in communities give insight about cultures and symbols in dreams. The community structure of different networks highlights cultural similarities and differences. Interconnections between different networks are also identified by translating symbols from different languages into English. Structural correlations across networks point out relationships between cultures. Similarities between network communities are also investigated by analysis of sentiment in symbol interpretations. We find that interpretations within a community tend to have similar sentiment. Furthermore, we cluster communities based on their sentiment, yielding three main categories of positive, negative, and neutral dream symbols.

Data extraction and object search 2014 workshop

Evaluation of information extraction techniques to label extracted data from e-commerce web pages BIBAFull-Text 1275-1278
  Neil Anderson; Jun Hong
Automatically determining and assigning shared and meaningful text labels to data extracted from an e-Commerce web page is a challenging problem. An e-Commerce web page can display a list of data records, each of which can contain a combination of data items (e.g. product name and price) and explicit labels, which describe some of these data items. Recent advances in extraction techniques have made it much easier to precisely extract individual data items and labels from a web page, however, there are two open problems: 1. assigning an explicit label to a data item, and 2. determining labels for the remaining data items. Furthermore, improvements in the availability and coverage of vocabularies, especially in the context of e-Commerce web sites, means that we now have access to a bank of relevant, meaningful and shared labels which can be assigned to extracted data items.
   However, there is a need for a technique which will take as input a set of extracted data items and assign automatically to them the most relevant and meaningful labels from a shared vocabulary. We observe that the Information Extraction (IE) community has developed a great number of techniques which solve problems similar to our own. In this work-in-progress paper we propose our intention to theoretically and experimentally evaluate different IE techniques to ascertain which is most suitable to solve this problem.
An analysis of duplicate on web extracted objects BIBAFull-Text 1279-1284
  Stefano Ortona
Today the web has become the largest available source of information. The automatic extraction of structured data from web is a challenging problem that has been widely investigated. However, after the extraction process, the problem of identifying duplicates among the extracted web records must be solved in order to present clean data to the final user. This problem, also known as record linkage or record matching, has been of central interest for the database community; however, only few works have addressed this problem in the web context. In this paper we present web object matching, the problem of identifying duplicates among records extracted from the web.
   We will show that in the web scenario we need to face all the problems of a classic record linkage setting plus the uncertainty introduced by the web. Indeed the records are the output of an extraction system that, rather than conventional databases or APIs, introduces semantic errors that are not due to a problem in the source. Most of the previous approaches rely on the fact that the records to match contain the correct information and we can use such information to identify duplicates. In this work we overview an approach that performs a validation step before the actual identification of duplicates, in order to check whether the information of the record can be trusted or not. We present an approach that works without any human supervision or training data and that deals with the problem not only in a record-by-record fashion (as other approaches), but also in a source-by-source fashion which allows detecting and possibly correcting systematic errors for an entire source. The only human effort required is the creation of a little knowledge about the domain of interest through a set of ontology constraints and an entity extraction system.
Iterative algorithm for inferring entity types from enumerative descriptions BIBAFull-Text 1285-1290
  Qian Chen; Mizuho Iwaihara
Entity type matching has many real world applications, especially in entity clustering, de-duplication and efficient query processing. Current methods to extract entities from text usually disregard regularities in the order of entities appearing in the text. In this paper, we focus on enumerative descriptions which enlist entity names in a certain hidden order, often occurring in web documents as listings and tables. We propose an algorithm to discover entity types from enumerative descriptions, where a type hierarchy is known but enumerating orders are hidden and heterogeneous, and partial entity-type mappings are given as seed instances. Our algorithm is iterative: We extract skeletons from syntactic patterns, then train a hidden Markov model to find an optimum enumerating order from seed instances and skeletons, to find a best-fit entity-type assignment.
Query interfaces understanding by statistical parsing BIBAFull-Text 1291-1294
  Weifeng Su; Yafei Li; Frederick H. Lochovsky
Users submit queries to an online database via its query interface. Query interface parsing, which is important for many applications, understands the query capabilities of a query interface. Since most query interfaces are organized hierarchically, we present a novel query interface parsing method, StatParser (Statistical Parser), to automatically extract the hierarchical query capabilities of query interfaces. StatParser automatically learns from a set of parsed query interfaces and parses new query interfaces. StatParser starts from a small grammar and enhances the grammar with a set of probabilities learned from parsed query interfaces under the maximum-entropy principle. Given a new query interface, the probability-enhanced-grammar identifies the parse tree with the largest global probability to be the query capabilities of the query interface. Experimental results show that StatParser very accurately extracts the query capabilities and can effectively overcome the problems of existing query interface parsers.
Extraction and integration of web sources with humans and domain knowledge BIBAFull-Text 1295-1298
  Disheng Qiu; Lorenzo Luce
The extraction and integration of data from many web sources in different domains is an open issue. Two promising solutions take on this challenge: top down approaches rely on a domain knowledge that is manually crafted by an expert to guide the process and bottom up approaches try to infer the schema from many web sources to make sense of the extracted data. The first solutions scale over the number of web sources, but for settings with different domains, an expert has to manually craft an ontology for each domain. The second solutions do not require a domain expert, but high quality is achieved only with a lot of human interactions both in the extraction and integration steps. We introduce a framework that takes the best from both approaches. The framework addresses synergically both extraction and integration of data from web sources. No domain expert is required, it exploits data from a seed knowledge base to enhance the automatic extraction and integration (top down). Human workers from crowdsourcing platforms are engaged to improve the quality and the coverage of the extracted data. The framework adopts techniques to automatically extract both the schema and the data from multiple web sources (bottom up). The extracted information is then used to bootstrap the seed knowledge base, reducing in this way the human effort for future tasks.
Integrating product data from websites offering microdata markup BIBAFull-Text 1299-1304
  Petar Petrovski; Volha Bryl; Christian Bizer
Large numbers of websites have started to markup their content using standards such as Microdata, Microformats, and RDFa. The marked-up content elements comprise descriptions of people, organizations, places, events, products, ratings, and reviews. This development has accelerated in last years as major search engines such as Google, Bing and Yahoo! use the markup to improve their search results. Embedding semantic markup facilitates identifying content elements on webpages. However, the markup is mostly not as fine-grained as desirable for applications that aim to integrate data from large numbers of websites. This paper discusses the challenges that arise in the task of integrating descriptions of electronic products from several thousand e-shops that offer Microdata markup. We present a solution for each step of the data integration process including Microdata extraction, product classification, product feature extraction, identity resolution, and data fusion. We evaluate our processing pipeline using 1.9 million product offers from 9240 e-shops which we extracted from the Common Crawl 2012, a large public Web corpus.
Entity linking with a unified semantic representation BIBAFull-Text 1305-1310
  Zhaochen Guo; Denilson Barbosa
Entity Linking (EL) consists in linking mentions in a document to their referent entities in a Knowledge Base. Current approaches fall into two main categories: local approaches, in which mentions are linked independently of each other, and global approaches, in which all mentions are linked collectively. Local approaches often ignore the semantic relatedness of entities, and while global approaches incorporate the semantic relatedness, they tend to focus only on directly connected entities, ignoring indirect connections which might be useful. We present a global EL approach that unifies the representation of the semantics of entities and documents -- the probability distribution of entities being visited during a random walk on an entity graph -- that accounts for direct and indirect connections. An experimental evaluation shows that our method outperforms five state-of-the-art EL systems and two very strong baselines.

2014 large scale network analysis workshop (LSNA'14)

Online analysis of information diffusion in Twitter BIBAFull-Text 1313-1318
  Io Taxidou; Peter M. Fischer
The advent of social media has facilitated the study of information diffusion, user interaction and user influence over social networks. The research on analyzing information spreading focuses mostly on modeling, while analyses of real-life data have been limited to small, carefully cleaned datasets that are analyzed in an offline fashion. In this paper, we present an approach for online analysis of information diffusion in Twitter. We reconstruct so-called information cascades that model how information is being propagated from user to user from the stream of messages and the social graph. The results show that such an inference is feasible even on noisy, large-scale, rapidly produced data. We provide insights into the impact of incomplete data and the effect of different influence models on the cascades. The observed cascades show a significant amount of variety in scale and structure.
The multi agent based information diffusion model for false rumordiffusion analysis BIBAFull-Text 1319-1320
  Satoshi Kurihara
Twitter is a famous social networking service and has received attention recently. Twitter user have increased rapidly, and many users exchange information. When the East Japan great earthquake disaster occurred on March 11, 2011, many people could obtain important information from social networking service. Although twitter also played the important role a false rumor diffusion was pointed out. So, in this talk, I would like to focus on the false rumor diffusion phenomena, and introduce about our multi agent information diffusion model based on SIR model. And I would like to discuss about more rapid correction-tweet diffusion methodology.
Towards large-scale graph stream processing platform BIBAFull-Text 1321-1326
  Toyotaro Suzumura; Shunsuke Nishii; Masaru Ganse
In recent years, real-time data mining for large-scale time-evolving graphs is becoming a hot research topic. Most of the prior arts target relatively static graphs and also process them in store-and-process batch processing model. In this paper we propose a method of applying on-the-fly and incremental graph stream computing model to such dynamic graph analysis. To process large-scale graph streams on a cluster of nodes dynamically in a scalable fashion, we propose an incremental large-scale graph processing model called "Incremental GIM-V (Generalized Iterative Matrix-Vector Multiplication)". We also design and implement UNICORN, a system that adopts the proposed incremental processing model on top of IBM InfoSphere Streams. Our performance evaluation demonstrates that our method achieves up to 48% speedup on PageRank with Scale 16 Log-normal Graph (vertexes=65,536, edges=8,364,525) with 4 nodes, 3023% speedup on Random walk with Restart with Kronecker Graph with Scale 18 (vertexes=262,144, edges=8,388,608) with 4 nodes against original GIM-V.
Towards scalable X10 based link prediction for large scale social networks BIBAFull-Text 1327-1332
  Hidefumi Ogata; Toyotaro Suzumura
The use of social application such as Twitter or FaceBook becomes popular in recent years. In particular, Twitter increases the number of the users rapidly from 2009 as the place that users can tweet anything in 140 characters. In the area of social network analysis, the user network of Twitter is frequently analyzed. Haewoon, et.al.,[1] analyzed the Twitter user network from various point of view in 2009, and they show that the Twitter user network has some different feature from conventional social networks. Bongwon, et.al., also made a collection of 74 millions tweets in 2010, and investigated the influence that "retweet" gives for diffusion of the information. Such analysis not only reveal the unique characteristics of Twitter user network, but also make some networking service such as finding users who are similar to someone, or the recommendation of commodities by using tweet information. There are some analysis such as clustering which needs entire data of the network. However, since social networks are increasing day by day, it becomes impossible to obtain the entire network by crawling. As a solution of this problem, there is the network analysis called link prediction. This enables to predict true network from a given part of the network. If we use link prediction, we can recover the entire network from the network data which we already obtained, and apply some analysis such as clustering to predicted network, then we may get the approximate result of the analysis for the entire network. In our research, we implemented one of the link prediction algorithm named Link Propagation in X10, which is a parallel programming language. And evaluated its scalability and precision with Twitter user network data.
A novel link prediction approach for scale-free networks BIBAFull-Text 1333-1338
  Chungmok Lee; Minh Pham; Norman Kim; Myong K. Jeong; Dennis K. J. Lin; Wanpracha Art Chavalitwongse
The link prediction problem is to predict the existence of a link between every node pair in the network based on the past observed networks arising in many practical applications such as recommender systems, information retrieval, and the marketing analysis of social networks. Here, we propose a new mathematical programming approach for predicting a future network utilizing the node degree distribution identified from historical observation of the past networks. We develop an integer programming problem for the link prediction problem, where the objective is to maximize the sum of link scores (probabilities) while respecting the node degree distribution of the networks. The performance of the proposed framework is tested on the real-life Facebook networks. The computational results show that the proposed approach can considerably improve the performance of previously published link prediction methods.
Pruned labeling algorithms: fast, exact, dynamic, simple and general indexing scheme for shortest-path queries BIBAFull-Text 1339-1340
  Takuya Akiba
Shortest-paths and distances are two of the most fundamental notions for pairs of nodes on a network, and thus they play an important role in a wide range of applications such as network analysis and network-aware search. In this talk, I will introduce our indexing method for efficiently answering shortest-paths, referred to as pruned landmark labeling (SIGMOD'13). In spite of its simplicity, it significantly outperforms previous indexing methods in both scalability and query time. Moreover, interestingly, it turned out that the algorithm automatically exploits the common structures of real networks. We also briefly mention its variants: pruned path labeling (CIKM'13), pruned highway labeling (ALENEX'14) and historical pruned landmark labeling (WWW'14).