[1]
Highly Successful Projects Inhibit Coordination on Crowdfunding Sites
Backstage of Crowdsourcing Legitimacy, Performance and Crowd Support
/
Solomon, Jacob
/
Ma, Wenjuan
/
Wash, Rick
Proceedings of the ACM CHI'16 Conference on Human Factors in Computing
Systems
2016-05-07
v.1
p.4568-4572
© Copyright 2016 ACM
Summary: Donors on crowdfunding sites must coordinate their actions to identify and
collectively fund projects prior to their deadline. Some projects receive vast
support immediately upon launch. Other seemingly worthwhile projects have more
modest success or no success at raising funds. We examine how the presence of
high-performing "superstar' projects on a crowdfunding site affects donors'
ability to coordinate their actions and fund other less popular but still
worthwhile projects on the site. In a lab experiment where users simulate the
dynamics of a crowdfunding site, we found that superstar projects reduce the
likelihood that other projects are funded by the crowd, even when the super
project has no opportunity to steal away donations form other projects. We
argue that this is due to superstar projects setting too high of a standard of
what a "fundable" project looks like, leading donors to underestimate the
amount of support within a crowd for less exceptional projects.
[2]
Automatic Image Dataset Construction from Click-through Logs Using Deep
Neural Network
Session 9: Deep Learning and Multimedia
/
Bai, Yalong
/
Yang, Kuiyuan
/
Yu, Wei
/
Xu, Chang
/
Ma, Wei-Ying
/
Zhao, Tiejun
Proceedings of the 2015 ACM International Conference on Multimedia
2015-10-26
p.441-450
© Copyright 2015 ACM
Summary: Labelled image datasets are the backbone for high-level image understanding
tasks with wide application scenarios, and continuously drive and evaluate the
progress of feature designing and supervised learning models. Recently, the
million scale labelled image dataset further contributes to the rebirth of deep
convolutional neural network and bypass manual designing handcraft features.
However, the construction process of image dataset is mainly manual-based and
quite labor intensive, which often take years' efforts to construct a million
scale dataset with high quality. In this paper, we propose a deep learning
based method to construct large scale image dataset in an automatic way.
Specifically, word representation and image representation are learned in a
deep neural network from large amount of click-through logs, and further used
to define word-word similarity and image-word similarity. These two
similarities are used to automatize the two labor intensive steps in
manual-based image dataset construction: query formation and noisy image
removal. With a new proposed cross convolutional filter regularizer, we can
construct a million scale image dataset in one week. Finally, two image
datasets are constructed to verify the effectiveness of the method. In addition
to scale, the automatically constructed dataset has comparable accuracy,
diversity and cross-dataset generalization with manually labelled image
datasets.
[3]
LightLDA: Big Topic Models on Modest Computer Clusters
Technical Papers 2
/
Yuan, Jinhui
/
Gao, Fei
/
Ho, Qirong
/
Dai, Wei
/
Wei, Jinliang
/
Zheng, Xun
/
Xing, Eric Po
/
Liu, Tie-Yan
/
Ma, Wei-Ying
Proceedings of the 2015 International Conference on the World Wide Web
2015-05-18
v.1
p.1351-1361
© Copyright 2015 ACM
Summary: When building large-scale machine learning (ML) programs, such as massive
topic models or deep neural networks with up to trillions of parameters and
training examples, one usually assumes that such massive tasks can only be
attempted with industrial-sized clusters with thousands of nodes, which are out
of reach for most practitioners and academic researchers. We consider this
challenge in the context of topic modeling on web-scale corpora, and show that
with a modest cluster of as few as 8 machines, we can train a topic model with
1 million topics and a 1-million-word vocabulary (for a total of 1 trillion
parameters), on a document collection with 200 billion tokens -- a scale not
yet reported even with thousands of machines. Our major contributions include:
1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose
running cost is (surprisingly) agnostic of model size, and empirically
converges nearly an order of magnitude more quickly than current
state-of-the-art Gibbs samplers; 2) a model-scheduling scheme to handle the big
model challenge, where each worker machine schedules the fetch/use of
sub-models as needed, resulting in a frugal use of limited memory capacity and
network bandwidth; 3) a differential data-structure for model storage, which
uses separate data structures for high- and low-frequency words to allow
extremely large models to fit in memory, while maintaining high inference
speed. These contributions are built on top of the Petuum open-source
distributed ML framework, and we provide experimental evidence showing how this
development puts massive data and models within reach on a small cluster, while
still enjoying proportional time cost reductions with increasing cluster size.
[4]
Don't Wait!: How Timing Affects Coordination of Crowdfunding Donations
Studies of Coordination
/
Solomon, Jacob
/
Ma, Wenjuan
/
Wash, Rick
Proceedings of ACM CSCW 2015 Conference on Computer-Supported Cooperative
Work and Social Computing
2015-02-28
v.1
p.547-556
© Copyright 2015 ACM
Summary: Crowdfunding sites often impose deadlines for projects to receive their
requested funds. This deadline structure creates a difficult decision for
potential donors. Donors can donate early to a project to help it reach its
goal and to signal to other donors that the project is worthwhile. But donors
may also want to wait for a similar signal from others. We conduct an
experimental simulation of a crowdfunding website to explore how potential
donors to projects make this decision. We find evidence for both strategies in
our experiment; some donate early while others wait till the last second.
However, we also find that making an early donation is usually a better
strategy for donors because the amount of donations made early in a project's
campaign is often the only difference between that project being funded or not.
This finding suggests that crowdfunding sites need to develop designs, policies
and incentives that encourage people to make immediate donations so that the
site can most efficiently fund projects.
[5]
Bag-of-Words Based Deep Neural Network for Image Retrieval
Multimedia Grand Challenge
/
Bai, Yalong
/
Yu, Wei
/
Xiao, Tianjun
/
Xu, Chang
/
Yang, Kuiyuan
/
Ma, Wei-Ying
/
Zhao, Tiejun
Proceedings of the 2014 ACM International Conference on Multimedia
2014-11-03
p.229-232
© Copyright 2014 ACM
Summary: This work targets image retrieval task hold by MSR-Bing Grand Challenge.
Image retrieval is considered as a challenge task because of the gap between
low-level image representation and high-level textual query representation.
Recently further developed deep neural network sheds light on narrowing the gap
by learning high-level image representation from raw pixels. In this paper, we
proposed a bag-of-words based deep neural network for image retrieval task,
which learns high-level image representation and maps images into bag-of-words
space. The DNN model is trained on the large scale clickthrough data, and the
relevance between query and image is measured by the cosine similarity of
query's bag-of-words representation and image's bag-of-words representation
predicted by DNN, the visual similarity of images is computed by high-level
image representation extracted via the DNN model too. Finally, PageRank
algorithm is used to further improve the ranking list by considering visual
similarity of images for each query. The experimental results achieved
state-of-the-art performance and verified the effectiveness of our proposed
method.
[6]
Knowledge sharing and social media: Altruism, perceived online attachment
motivation, and perceived online relationship commitment
/
Ma, Will W. K.
/
Chan, Albert
Computers in Human Behavior
2014-10
v.39
n.0
p.51-58
Keywords: Knowledge sharing
Keywords: Perceived online attachment motivation
Keywords: Perceived online relationship commitment
Keywords: Altruism
Keywords: Social media
© Copyright 2014 Elsevier Ltd.
Summary: Social media, such as Facebook and Twitter, have become extremely popular.
Facebook, for example, has more than a billion registered users and thousands
of millions of units of information are shared every day, including short
phrases, articles, photos, and audio and video clips. However, only a tiny
proportion of these sharing units trigger any type of knowledge exchange that
is ultimately beneficial to the users. This study draws on the theory of
belonging and the intrinsic motivation of altruism to explore the factors
contributing to knowledge sharing behavior. Using a survey of 299 high school
students applying for university after the release of the public examination
results, we find that perceived online attachment motivation (β² =
0.31, p < 0.001) and perceived online relationship commitment (β²
= 0.49, p < 0.001) have positive, direct, and significant effects on online
knowledge sharing (R² 0.568). Moreover, when introduced into the model,
altruism has a direct and significant effect on online knowledge sharing
(β² = 0.46, p < 0.001) and the total variance explained by the
extended model increases to 64.9%. The implications of the findings are
discussed.
[7]
Indoor air quality monitoring system for smart buildings
Energy & environment
/
Chen, Xuxu
/
Zheng, Yu
/
Chen, Yubiao
/
Jin, Qiwei
/
Sun, Weiwei
/
Chang, Eric
/
Ma, Wei-Ying
Proceedings of the 2014 International Joint Conference on Pervasive and
Ubiquitous Computing
2014-09-13
v.1
p.471-475
© Copyright 2014 ACM
Summary: Many developing countries are suffering from air pollution, especially the
Particulate Matter with diameter of 2.5 micrometers or less (PM2.5). While
quite a few air quality monitoring stations have been built by governments in a
city's public areas, the indoor PM2.5 has not yet been monitored and dealt with
effectively. Though many office buildings have an HVAC (heating, ventilation,
and air conditioning) system, PM2.5 is not considered as a factor when the
system circulates fresh air from outdoors. This paper introduces a real system
that we have deployed in the offices of four Microsoft campuses in China. This
system instantly monitors indoor air quality on different floors of a building
(including office areas, gyms, garages, and restaurants), enabling Microsoft
employees to enquire the air quality of a place by using a mobile phone or
checking a website. The information can guide a user's decision making, e.g.,
finding the right time to work out in the gym or turn on individual air filters
in her own office. Through analyzing the indoor and outdoor air quality data
collected over a long period, our system can even offer actionable and
energy-efficient suggestion to HVAC systems, e.g., automatically turning on the
system only a few hours earlier than usual if it is a heavily polluted day, or
identifying the filters in HVAC system that should be renewed.
[8]
Comparison of Enhanced Visual and Haptic Features in a Virtual Reality-Based
Haptic Simulation
Haptic Interaction
/
Clamann, Michael
/
Ma, Wenqi
/
Kaber, David B.
HCI International 2013: 15th International Conference on HCI, Part IV:
Interaction Modalities and Techniques
2013-07-21
v.4
p.551-560
Keywords: haptics; virtual reality; rehabilitation
© Copyright 2013 Springer-Verlag
Summary: An experiment was conducted to compare the learning effects following motor
skill training using three types of virtual reality simulations. Training and
testing were presented using virtual reality (VR) and standardized forms of
existing psychomotor tests, respectively. The VR training simulations included
haptic, visual and a combination of haptic and visual assistance designed to
accelerate training. A comparison of performance test results prior to and
following training revealed conditions providing haptic assistance to yield
lower scores related to fine motor skill training than the visual-only aiding
condition. Similarly, training in the visual condition resulted in
comparatively lower cognitive skill scores. The present investigation
incorporating healthy subjects was designed as part of an ongoing research
effort to provide insight on the design of VR simulations for rehabilitation of
motor skills in patients with a history of mTBI.
[9]
Impact of restrictive composition policy on user password choices
/
Campbell, John
/
Ma, Wanli
/
Kleeman, Dale
Behaviour and Information Technology
2011-05-01
v.30
n.3
p.379-388
© Copyright 2011 Taylor and Francis
Summary: This study investigates the efficacy of using a restrictive password
composition policy. The primary function of access controls is to restrict the
use of information systems and other computer resources to authorised users
only. Although more secure alternatives exist, password-based systems remain
the predominant method of user authentication. Prior research shows that
password security is often compromised by users who adopt inadequate password
composition and management practices. One particularly under-researched area is
whether restrictive password composition policies actually change user
behaviours in significant ways. The results of this study show that a password
composition policy reduces the similarity of passwords to dictionary words.
However, in this case the regime did not reduce the use of meaningful
information in passwords such as names and birth dates, nor did it reduce
password recycling.
[10]
Recommending friends and locations based on individual location history
/
Zheng, Yu
/
Zhang, Lizhu
/
Ma, Zhengxin
/
Xie, Xing
/
Ma, Wei-Ying
ACM Transactions on The Web
2011-02
v.5
n.1
p.5
© Copyright 2011 ACM
Summary: The increasing availability of location-acquisition technologies (GPS, GSM
networks, etc.) enables people to log the location histories with
spatio-temporal data. Such real-world location histories imply, to some extent,
users' interests in places, and bring us opportunities to understand the
correlation between users and locations. In this article, we move towards this
direction and report on a personalized friend and location recommender for the
geographical information systems (GIS) on the Web. First, in this recommender
system, a particular individual's visits to a geospatial region in the real
world are used as their implicit ratings on that region. Second, we measure the
similarity between users in terms of their location histories and recommend to
each user a group of potential friends in a GIS community. Third, we estimate
an individual's interests in a set of unvisited regions by involving his/her
location history and those of other users. Some unvisited locations that might
match their tastes can be recommended to the individual. A framework, referred
to as a hierarchical-graph-based similarity measurement (HGSM), is proposed to
uniformly model each individual's location history, and effectively measure the
similarity among users. In this framework, we take into account three factors:
1) the sequence property of people's outdoor movements, 2) the visited
popularity of a geospatial region, and 3) the hierarchical property of
geographic spaces. Further, we incorporated a content-based method into a
user-based collaborative filtering algorithm, which uses HGSM as the user
similarity measure, to estimate the rating of a user on an item. We evaluated
this recommender system based on the GPS data collected by 75 subjects over a
period of 1 year in the real world. As a result, HGSM outperforms related
similarity measures, namely similarity-by-count, cosine similarity, and Pearson
similarity measures. Moreover, beyond the item-based CF method and random
recommendations, our system provides users with more attractive locations and
better user experiences of recommendation.
[11]
Pricing guaranteed contracts in online display advertising
KM track: large-scale statistical techniques
/
Bharadwaj, Vijay
/
Ma, Wenjing
/
Schwarz, Michael
/
Shanmugasundaram, Jayavel
/
Vee, Erik
/
Xie, Jack
/
Yang, Jian
Proceedings of the 2010 ACM Conference on Information and Knowledge
Management
2010-10-26
p.399-408
© Copyright 2010 ACM
Summary: We consider the problem of pricing guaranteed contracts in online display
advertising. This problem has two key characteristics that when taken together
distinguish it from related offline and online pricing problems: (1) the
guaranteed contracts are sold months in advance, and at various points in time,
and (2) the inventory that is sold to guaranteed contracts -- user visits -- is
very high-dimensional, having hundreds of possible attributes, and advertisers
can potentially buy any of the very large number (many trillions) of
combinations of these attributes. Consequently, traditional pricing methods
such as real-time or combinatorial auctions, or optimization-based pricing
based on self- and cross-elasticities are not directly applicable to this
problem. We hence propose a new pricing method, whereby the price of a
guaranteed contract is computed based on the prices of the individual user
visits that the contract is expected to get. The price of each individual user
visit is in turn computed using historical sales prices that are negotiated
between a sales person and an advertiser, and we propose two different variants
in this context. Our evaluation using real guaranteed contracts shows that the
proposed pricing method is accurate in the sense that it can effectively
predict the prices of other (out-of-sample) historical contracts.
[12]
Mining adjacent markets from a large-scale ads video collection for image
advertising
Poster presentations
/
Feng, Guwen
/
Wang, Xin-Jing
/
Zhang, Lei
/
Ma, Wei-Ying
Proceedings of the 33rd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2010-07-19
p.893-894
Keywords: adjacent marketing, image advertising, video retrieval
© Copyright 2010 ACM
Summary: The research on image advertising is still in its infancy. Most previous
approaches suggest ads by directly matching an ad to a query image, which lacks
the power to identify ads from adjacent market. In this paper, we tackle the
problem by mining knowledge on adjacent markets from ads videos with a novel
Multi-Modal Dirichlet Process Mixture Sets model, which is a unified model of
(video frames) clustering and (ads) ranking. Our approach is not only capable
of discovering relevant ads (e.g. car ads for a query car image), but also
suggesting ads from adjacent markets (e.g. tyre ads). Experimental results show
that our proposed approach is fairly effective.
[13]
A large-scale study on map search logs
/
Xiao, Xiangye
/
Luo, Qiong
/
Li, Zhisheng
/
Xie, Xing
/
Ma, Wei-Ying
ACM Transactions on The Web
2010-07
v.4
n.3
p.8
© Copyright 2010 ACM
Summary: Map search engines, such as Google Maps, Yahoo! Maps, and Microsoft Live
Maps, allow users to explicitly specify a target geographic location, either in
keywords or on the map, and to search businesses, people, and other information
of that location. In this article, we report a first study on a million-entry
map search log. We identify three key attributes of a map search record -- the
keyword query, the target location and the user location, and examine the
characteristics of these three dimensions separately as well as the
associations between them. Comparing our results with those previously reported
on logs of general search engines and mobile search engines, including those
for geographic queries, we discover the following unique features of map
search: (1) People use longer queries and modify queries more frequently in a
session than in general search and mobile search; People view fewer result
pages per query than in general search; (2) The popular query topics in map
search are different from those in general search and mobile search; (3) The
target locations in a session change within 50 kilometers for almost 80% of the
sessions; (4) Queries, search target locations and user locations (both at the
city level) all follow the power law distribution; (5) One third of queries are
issued for target locations within 50 kilometers from the user locations; (6)
The distribution of a query over target locations appears to follow the
geographic location of the queried entity.
[14]
Back to the future: bleeding-edge IVR
Human interfaces
/
Bouzid, Ahmed
/
Ma, Weiye
interactions
2010-05
v.17
n.3
p.18-20
© Copyright 2010 ACM
[15]
Diversifying landmark image search results by learning interested views from
community photos
WWW 2010 demos
/
Ren, Yuheng
/
Yu, Mo
/
Wang, Xin-Jing
/
Zhang, Lei
/
Ma, Wei-Ying
Proceedings of the 2010 International Conference on the World Wide Web
2010-04-26
v.1
p.1289-1292
Keywords: landmark image search, set-based ranking, user interest modeling
© Copyright 2010 ACM
Summary: In this paper, we demonstrate a novel landmark photo search and browsing
system: Agate, which ranks landmark image search results considering their
relevance, diversity and quality. Agate learns from community photos the most
interested aspects and related activities of a landmark, and generates
adaptively a Table of Content (TOC) as a summary of the attractions to
facilitate the user browsing. Image search results are thus re-ranked with the
TOC so as to ensure a quick overview of the attractions of the landmarks. A
novel non-parametric TOC generation and set-based ranking algorithm, MoM-DPM
Sets, is proposed as the key technology of Agate. Experimental results based on
human evaluation show the effectiveness of our model and users' preference for
Agate.
[16]
Understanding transportation modes based on GPS data for web applications
/
Zheng, Yu
/
Chen, Yukun
/
Li, Quannan
/
Xie, Xing
/
Ma, Wei-Ying
ACM Transactions on The Web
2010-01
v.4
n.1
p.1
© Copyright 2010 ACM
Summary: User mobility has given rise to a variety of Web applications, in which the
global positioning system (GPS) plays many important roles in bridging between
these applications and end users. As a kind of human behavior, transportation
modes, such as walking and driving, can provide pervasive computing systems
with more contextual information and enrich a user's mobility with informative
knowledge. In this article, we report on an approach based on supervised
learning to automatically infer users' transportation modes, including driving,
walking, taking a bus and riding a bike, from raw GPS logs. Our approach
consists of three parts: a change point-based segmentation method, an inference
model and a graph-based post-processing algorithm. First, we propose a change
point-based segmentation method to partition each GPS trajectory into separate
segments of different transportation modes. Second, from each segment, we
identify a set of sophisticated features, which are not affected by differing
traffic conditions (e.g., a person's direction when in a car is constrained
more by the road than any change in traffic conditions). Later, these features
are fed to a generative inference model to classify the segments of different
modes. Third, we conduct graph-based postprocessing to further improve the
inference performance. This postprocessing algorithm considers both the
commonsense constraints of the real world and typical user behaviors based on
locations in a probabilistic manner. The advantages of our method over the
related works include three aspects. (1) Our approach can effectively segment
trajectories containing multiple transportation modes. (2) Our work mined the
location constraints from user-generated GPS logs, while being independent of
additional sensor data and map information like road networks and bus stops.
(3) The model learned from the dataset of some users can be applied to infer
GPS data from others. Using the GPS logs collected by 65 people over a period
of 10 months, we evaluated our approach via a set of experiments. As a result,
based on the change-point-based segmentation method and Decision Tree-based
inference model, we achieved prediction accuracy greater than 71 percent.
Further, using the graph-based post-processing algorithm, the performance
attained a 4-percent enhancement.
[17]
Incorporating site-level knowledge to extract structured data from web
forums
Data mining/session: learning
/
Yang, Jiang-Ming
/
Cai, Rui
/
Wang, Yida
/
Zhu, Jun
/
Zhang, Lei
/
Ma, Wei-Ying
Proceedings of the 2009 International Conference on the World Wide Web
2009-04-20
p.181-190
Keywords: Markov logic networks (MLNS), information extraction, site-level knowledge,
structured data, web forums
© Copyright 2009 International World Wide Web Conference Committee (IW3C2)
Summary: Web forums have become an important data resource for many web applications,
but extracting structured data from unstructured web forum pages is still a
challenging task due to both complex page layout designs and unrestricted user
created posts. In this paper, we study the problem of structured data
extraction from various web forum sites. Our target is to find a solution as
general as possible to extract structured data, such as post title, post
author, post time, and post content from any forum site. In contrast to most
existing information extraction methods, which only leverage the knowledge
inside an individual page, we incorporate both page-level and site-level
knowledge and employ Markov logic networks (MLNs) to effectively integrate all
useful evidence by learning their importance automatically. Site-level
knowledge includes (1) the linkages among different object pages, such as list
pages and post pages, and (2) the interrelationships of pages belonging to the
same object. The experimental results on 20 forums show a very encouraging
information extraction performance, and demonstrate the ability of the proposed
approach on various forums. We also show that the performance is limited if
only page-level knowledge is used, while when incorporating the site-level
knowledge both precision and recall can be significantly improved.
[18]
Mining interesting locations and travel sequences from GPS trajectories
User interfaces and mobile web/session: mobile web
/
Zheng, Yu
/
Zhang, Lizhu
/
Xie, Xing
/
Ma, Wei-Ying
Proceedings of the 2009 International Conference on the World Wide Web
2009-04-20
p.791-800
Keywords: GPS trajectories, location recommendation, spatial data mining, user travel
experience
© Copyright 2009 International World Wide Web Conference Committee (IW3C2)
Summary: The increasing availability of GPS-enabled devices is changing the way
people interact with the Web, and brings us a large amount of GPS trajectories
representing people's location histories. In this paper, based on multiple
users' GPS trajectories, we aim to mine interesting locations and classical
travel sequences in a given geospatial region. Here, interesting locations mean
the culturally important places, such as Tiananmen Square in Beijing, and
frequented public areas, like shopping malls and restaurants, etc. Such
information can help users understand surrounding locations, and would enable
travel recommendation. In this work, we first model multiple individuals'
location histories with a tree-based hierarchical graph (TBHG). Second, based
on the TBHG, we propose a HITS (Hypertext Induced Topic Search)-based inference
model, which regards an individual's access on a location as a directed link
from the user to that location. This model infers the interest of a location by
taking into account the following three factors. 1) The interest of a location
depends on not only the number of users visiting this location but also these
users' travel experiences. 2) Users' travel experiences and location interests
have a mutual reinforcement relationship. 3) The interest of a location and the
travel experience of a user are relative values and are region-related. Third,
we mine the classical travel sequences among locations considering the
interests of these locations and users' travel experiences. We evaluated our
system using a large GPS dataset collected by 107 users over a period of one
year in the real world. As a result, our HITS-based inference model
outperformed baseline approaches like rank-by-count and rank-by-frequency.
Meanwhile, when considering the users' travel experiences and location
interests, we achieved a better performance beyond baselines, such as
rank-by-count and rank-by-interest, etc.
[19]
Browsing on small displays by transforming Web pages into hierarchically
structured subpages
/
Xiao, Xiangye
/
Luo, Qiong
/
Hong, Dan
/
Fu, Hongbo
/
Xie, Xing
/
Ma, Wei-Ying
ACM Transactions on The Web
2009-01
v.3
n.1
p.4
© Copyright 2009 ACM
Summary: We propose a new Web page transformation method to facilitate Web browsing
on handheld devices such as Personal Digital Assistants (PDAs). In our
approach, an original Web page that does not fit on the screen is transformed
into a set of subpages, each of which fits on the screen. This transformation
is done through slicing the original page into page blocks iteratively, with
several factors considered. These factors include the size of the screen, the
size of each page block, the number of blocks in each transformed page, the
depth of the tree hierarchy that the transformed pages form, as well as the
semantic coherence between blocks. We call the tree hierarchy of the
transformed pages an SP-tree. In an SP-tree, an internal node consists of a
textually enhanced thumbnail image with hyperlinks, and a leaf node is a block
extracted from a subpage of the original Web page. We adaptively adjust the
fanout and the height of the SP-tree so that each thumbnail image is clear
enough for users to read, while at the same time, the number of clicks needed
to reach a leaf page is few. Through this transformation algorithm, we preserve
the contextual information in the original Web page and reduce scrolling. We
have implemented this transformation module on a proxy server and have
conducted usability studies on its performance. Our system achieved a shorter
task completion time compared with that of transformations from the Opera
browser in nine of ten tasks. The average improvement on familiar pages was
44%. The average improvement on unfamiliar pages was 37%. Subjective responses
were positive.
[20]
Search-based query suggestion
Poster session 2/information retrieval
/
Yang, Jiang-Ming
/
Cai, Rui
/
Jing, Feng
/
Wang, Shuo
/
Zhang, Lei
/
Ma, Wei-Ying
Proceedings of the 2008 ACM Conference on Information and Knowledge
Management
2008-10-26
p.1439-1440
© Copyright 2008 ACM
Summary: In this paper, we proposed a unified strategy to combine query log and
search results for query suggestion. In this way, we leverage both the users'
search intentions for popular queries and the power of search engines for
unpopular queries. The suggested queries are also ranked according to their
relevance and qualities; and each suggestion is described with a rich snippet
including a photo and related description.
[21]
Understanding mobility based on GPS data
Location-aware applications
/
Zheng, Yu
/
Li, Quannan
/
Chen, Yukun
/
Xie, Xing
/
Ma, Wei-Ying
Proceedings of the 2008 International Conference on Uniquitous Computing
2008-09-21
p.312-321
Keywords: GPS, GeoLife, infer transportation mode, machine learning, recognize human
behavior
© Copyright 2008 ACM
Summary: Both recognizing human behavior and understanding a user's mobility from
sensor data are critical issues in ubiquitous computing systems. As a kind of
user behavior, the transportation modes, such as walking, driving, etc., that a
user takes, can enrich the user's mobility with informative knowledge and
provide pervasive computing systems with more context information. In this
paper, we propose an approach based on supervised learning to infer people's
motion modes from their GPS logs. The contribution of this work lies in the
following two aspects. On one hand, we identify a set of sophisticated
features, which are more robust to traffic condition than those other
researchers ever used. On the other hand, we propose a graph-based
post-processing algorithm to further improve the inference performance. This
algorithm considers both the commonsense constraint of real world and typical
user behavior based on location in a probabilistic manner. Using the GPS logs
collected by 65 people over a period of 10 months, we evaluated our approach
via a set of experiments. As a result, based on the change point-based
segmentation method and Decision Tree-based inference model, the new features
brought an eight percent improvement in inference accuracy over previous
result, and the graph-based post-processing achieve a further four percent
enhancement.
[22]
Directly optimizing evaluation measures in learning to rank
Learning to rank: 1
/
Xu, Jun
/
Liu, Tie-Yan
/
Lu, Min
/
Li, Hang
/
Ma, Wei-Ying
Proceedings of the 31st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2008-07-20
p.107-114
© Copyright 2008 ACM
Summary: One of the central issues in learning to rank for information retrieval is
to develop algorithms that construct ranking models by directly optimizing
evaluation measures used in information retrieval such as Mean Average
Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). Several such
algorithms including SVMmap and AdaRank have been proposed and their
effectiveness has been verified. However, the relationships between the
algorithms are not clear, and furthermore no comparisons have been conducted
between them. In this paper, we conduct a study on the approach of directly
optimizing evaluation measures in learning to rank for Information Retrieval
(IR). We focus on the methods that minimize loss functions upper bounding the
basic loss function defined on the IR measures. We first provide a general
framework for the study and analyze the existing algorithms of SVMmap and
AdaRank within the framework. The framework is based on upper bound analysis
and two types of upper bounds are discussed. Moreover, we show that we can
derive new algorithms on the basis of this analysis and create one example
algorithm called PermuRank. We have also conducted comparisons between SVMmap,
AdaRank, PermuRank, and conventional methods of Ranking SVM and RankBoost,
using benchmark datasets. Experimental results show that the methods based on
direct optimization of evaluation measures can always outperform conventional
methods of Ranking SVM and RankBoost. However, no significant difference exists
among the performances of the direct optimization methods themselves.
[23]
Exploring traversal strategy for web forum crawling
Analysis of social networks
/
Wang, Yida
/
Yang, Jiang-Ming
/
Lai, Wei
/
Cai, Rui
/
Zhang, Lei
/
Ma, Wei-Ying
Proceedings of the 31st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2008-07-20
p.459-466
© Copyright 2008 ACM
Summary: In this paper, we study the problem of Web forum crawling. Web forum has now
become an important data source of many Web applications; while forum crawling
is still a challenging task due to complex in-site link structures and login
controls of most forum sites. Without carefully selecting the traversal path, a
generic crawler usually downloads many duplicate and invalid pages from forums,
and thus wastes both the precious bandwidth and the limited storage space. To
crawl forum data more effectively and efficiently, in this paper, we propose an
automatic approach to exploring an appropriate traversal strategy to direct the
crawling of a given target forum. In detail, the traversal strategy consists of
the identification of the skeleton links and the detection of the page-flipping
links. The skeleton links instruct the crawler to only crawl valuable pages and
meanwhile avoid duplicate and uninformative ones; and the page-flipping links
tell the crawler how to completely download a long discussion thread which is
usually shown in multiple pages in Web forums. The extensive experimental
results on several forums show encouraging performance of our approach.
Following the discovered traversal strategy, our forum crawler can archive more
informative pages in comparison with previous related work and a commercial
generic crawler.
[24]
Rich media and web 2.0
Panels
/
Chang, Edward
/
Ong, Ken
/
Boll, Susanne
/
Ma, Wei-Ying
Proceedings of the 2008 International Conference on the World Wide Web
2008-04-21
p.1259-1260
Keywords: rich media, web 2.0
© Copyright 2008 International World Wide Web Conference Committee (IW3C2)
Summary: Rich media data, such as video, imagery, music, and gaming, do no longer
play just a supporting role on the World Wide Web to text data. Thanks to Web
2.0, rich media is the primary content on sites such as Flickr, PicasaWeb,
YouTube, and QQ. Because of massive user generated content, the volume of rich
media being transmitted on the Internet has surpassed that of text. It is vital
to properly manage these data to ensure efficient bandwidth utilization, to
support effective indexing and search, and to safeguard copyrights (just to
name a few). This panel invites both researchers and practitioners to discuss
the challenges of Web-scale media-data management. In particular, the panelists
will address issues such as leveraging Rich Media and Web 2.0, indexing,
search, and scalability.
[25]
FRank: a ranking method with fidelity loss
Learning to rank II
/
Tsai, Ming-Feng
/
Liu, Tie-Yan
/
Qin, Tao
/
Chen, Hsin-Hsi
/
Ma, Wei-Ying
Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval
2007-07-23
p.383-390
© Copyright 2007 ACM
Summary: Ranking problem is becoming important in many fields, especially in
information retrieval (IR). Many machine learning techniques have been proposed
for ranking problem, such as RankSVM, RankBoost, and RankNet. Among them,
RankNet, which is based on a probabilistic ranking framework, is leading to
promising results and has been applied to a commercial Web search engine. In
this paper we conduct further study on the probabilistic ranking framework and
provide a novel loss function named fidelity loss for measuring loss of
ranking. The fidelity loss not only inherits effective properties of the
probabilistic ranking framework in RankNet, but possesses new properties that
are helpful for ranking. This includes the fidelity loss obtaining zero for
each document pair, and having a finite upper bound that is necessary for
conducting query-level normalization. We also propose an algorithm named FRank
based on a generalized additive model for the sake of minimizing the fidelity
loss and learning an effective ranking function. We evaluated the proposed
algorithm for two datasets: TREC dataset and real Web search dataset. The
experimental results show that the proposed FRank algorithm outperforms other
learning-based ranking methods on both conventional IR problem and Web search.