Cooperative leases: scalable consistency maintenance in content distribution networks | | BIBAK | Full-Text | 1-12 | |
Anoop Ninan; Purushottam Kulkarni; Prashant Shenoy; Krithi Ramamritham; Renu Tewari | |||
In this paper, we argue that cache consistency mechanisms designed for
stand-alone proxies do not scale to the large number of proxies in a content
distribution network and are not flexible enough to allow consistency
guarantees to be tailored to object needs. To meet the twin challenges of
scalability and flexibility, we introduce the notion of cooperative consistency
along with a mechanism, called cooperative leases, to achieve it. By supporting
Δ-consistency semantics and by using a single lease for multiple proxies,
cooperative leases allows the notion of leases to be applied in a flexible,
scalable manner to CDNs. Further, the approach employs application-level
multicast to propagate server notifications to proxies in a scalable manner. We
implement our approach in the Apache web server and the Squid proxy cache and
demonstrate its efficacy using a detailed experimental evaluation. Our results
show a factor of 2.5 reduction in server message overhead and a 20% reduction
in server state space overhead when compared to original leases albeit at an
increased inter-proxy communication overhead. Keywords: content distribution networks, data consistency, data dissemination, dynamic
data, leases, pullC, push, scalability, world wide web |
An evaluation of TCP splice benefits in web proxy servers | | BIBAK | Full-Text | 13-24 | |
A Marcel-Catalin Rosu; A Daniela Rosu | |||
This study is the first to evaluate the performance benefits of using the
recently proposed TCP Splice kernel service in Web proxy servers. Previous
studies show that splicing client and server TCP connections in the IP layer
improves the throughput of proxy servers like firewalls and content routers by
reducing the data transfer overheads. In a Web proxy server, data transfer
overheads represent a relatively large fraction of the request processing
overheads, in particular when content is not cacheable or the proxy cache is
memory-based. The study is conducted with a socket-level implementation of TCP
Splice. Compared to IP-level implementations, socket-level implementations make
possible the splicing of connections with different TCP characteristics, and
improve response times by reducing recovery delay after a packet loss. The
experimental evaluation is focused on HTTP request types for which the proxy
can fully exploit the TCP Splice service, which are the requests for
non-cacheable content and SSL tunneling. The experimental testbed includes an
emulated WAN environment and benchmark applications for HTTP/1.0 Web client,
Web server, and Web proxy running on AIX RS/6000 machines. Our experiments
demonstrate that TCP Splice enables reductions in CPU utilization of 10-43% of
the CPU, depending on file sizes and request rates. Larger relative reductions
are observed when tunneling SSL connections, in particular for small file
transfers. Response times are also reduced by up to 1.8sec. Keywords: TCP splice, web proxy |
Clarifying the fundamentals of HTTP | | BIBAK | Full-Text | 25-36 | |
Jeffery C. Mogul | |||
The simplicity of HTTP was a major factor in the success of the Web.
However, as both the protocol and its uses have evolved, HTTP has grown
complex. This complexity results in numerous problems, including confused
implementors, interoperability failures, difficulty in extending the protocol,
and a long specification without much documented rationale.
Many of the problems with HTTP can be traced to unfortunate choices about fundamental definitions and models. This paper analyzes the current (HTTP/1.1) protocol design, showing how it fails in certain cases, and how to improve these fundamentals. Some problems with HTTP can be fixed simply by adopting new models and terminology, allowing us to think more clearly about implementations and extensions. Other problems require explicit (but compatible) protocol changes. Keywords: HTTP, protocol design |
Streaming speech: a framework for generating and streaming 3D text-to-speech and audio presentations to wireless PDAs as specified using extensions to SMIL | | BIBAK | Full-Text | 37-44 | |
Stuart Goose; Sreedhar Kodlahalli; William Pechter; Rune Hjelsvold | |||
While monochrome unformatted text and richly colored graphical content are
both capable of conveying a message, well designed graphical content has the
potential for better engaging the human sensory system. It is our contention
that the author of an audio presentation should be afforded the benefit of
judiciously exploiting the human aural perceptual ability to deliver content in
a more compelling, concise and realistic manner. While contemporary streaming
media players and voice browsers share the ability to render content
non-textually, neither technology is currently capable of rendering three
dimensional media. The contributions described in this paper are proposed 3D
audio extensions to SMIL and a server-based framework able to receive a request
and, on-demand, process such a SMIL file and dynamically create the multiple
simultaneous audio objects, spatialize them in 3D space, multiplex them into a
single stereo audio and prepare it for transmission over an audio stream to a
mobile device. To the knowledge of the authors, this is the first reported
solution for delivering and rendering on a commercially available wireless
handheld device a rich 3D audio listening experience as described by a markup
language. Naturally, in addition to mobile devices this solution also works
with desktop streaming media players. Keywords: 3D audio, PDA, SMIL, accessibility, location-based, mobile, spatialization,
speech synthesis, streaming, wireless |
Multimedia meets computer graphics in SMIL2.0: a time model for the web | | BIBAK | Full-Text | 45-53 | |
Patrick Schmitz | |||
Multimedia scheduling models provide a rich variety of tools for managing
the synchronization of media like video and audio, but generally have an
inflexible model for time itself. In contrast, modern animation models in the
computer graphics community generally lack tools for synchronization and
structural time, but allow for a flexible concept of time, including variable
pacing, acceleration and deceleration and other tools useful for controlling
and adapting animation behaviors. Multimedia authors have been forced to choose
one set of features over the others, limiting the range of presentations they
can create. Some programming models addressed some of these problems, but
provided no declarative means for authors and authoring tools to leverage the
functionality. This paper describes a new model incorporated into SMIL 2.0 that
combines the strengths of scheduling models with the flexible time
manipulations of animation models. The implications of this integration are
discussed with respect to scheduling and structured time, drawing upon
experience with SMIL 2.0 timing and synchronization, and the integration with
XHTML. Keywords: animation, multimedia, synchronization, timing |
OCTOPUS: aggressive search of multi-modality data using multifaceted knowledge base | | BIBAK | Full-Text | 54-64 | |
Jun Yang; Qing Li; Yueting Zhuang | |||
An important trend in Web information processing is the support of
multimedia retrieval. However, the most prevailing paradigm for multimedia
retrieval, content-based retrieval (CBR), is a rather conservative one whose
performance depends on a set of specifically defined low-level features and a
carefully chosen sample object. In this paper, an aggressive search mechanism
called Octopus is proposed which addresses the retrieval of multi-modality data
using multifaceted knowledge. In particular, Octopus promotes a novel scenario
in which the user supplies seed objects of arbitrary modality as the hint of
his information need, and receives a set of multi-modality objects satisfying
his need. The foundation of Octopus is a multifaceted knowledge base
constructed on a layered graph model (LGM), which describes the relevance
between media objects from various perspectives. Link analysis based retrieval
algorithm is proposed based on the LGM. A unique relevance feedback technique
is developed to update the knowledge base by learning from user behaviors, and
to enhance the retrieval performance in a progressive manner. A prototype
implementing the proposed approach has been developed to demonstrate its
feasibility and capability through illustrative examples. Keywords: layered graph model, link analysis, multi-modality data, multifaceted
knowledge base, multimedia retrieval, relevance feedback |
XL: an XML programming language for web service specification and composition | | BIBAK | Full-Text | 65-76 | |
Daniela Florescu; Andreas Grünhagen; Donald Kossmann | |||
We present an XML programming language specially designed for the
implementation of Web services. XL is portable and fully compliant with W3C
standards such as XQuery, XML Protocol, and XML Schema. One of the key features
of XL is that it allows programmers to concentrate on the logic of their
application. XL provides high-level and declarative constructs for actions
which are typically carried out in the implementation of a Web service; e.g.,
logging, error handling, retry of actions, workload management, events, etc.
Issues such as performance tuning (e.g., caching, horizontal partitioning,
etc.) should be carried out automatically by an implementation of the language.
This way, the productivity of the programmers, the ability of evolution of the
programs, and the chances to achieve good performance are substantially
enhanced. Keywords: XML, programming language, web service |
Simulation, verification and automated composition of web services | | BIBAK | Full-Text | 77-88 | |
Srini Narayanan; Sheila A. McIlraith | |||
Web services -- Web-accessible programs and devices -- are a key application
area for the Semantic Web. With the proliferation of Web services and the
evolution towards the Semantic Web comes the opportunity to automate various
Web services tasks. Our objective is to enable markup and automated reasoning
technology to describe, simulate, compose, test, and verify compositions of Web
services. We take as our starting point the DAML-S DAML+OIL ontology for
describing the capabilities of Web services. We define the semantics for a
relevant subset of DAML-S in terms of a first-order logical language. With the
semantics in hand, we encode our service descriptions in a Petri Net formalism
and provide decision procedures for Web service simulation, verification and
composition. We also provide an analysis of the complexity of these tasks under
different restrictions to the DAML-S composite services we can describe.
Finally, we present an implementation of our analysis techniques. This
implementation takes as input a DAML-S description of a Web service,
automatically generates a Petri Net and performs the desired analysis. Such a
tool has broad applicability both as a back end to existing manual Web service
composition tools, and as a stand-alone tool for Web service developers. Keywords: DAML, automated reasoning, distributed systems, ontologies, semantic web,
web service composition, web services |
Semantic web support for the business-to-business e-commerce lifecycle | | BIBAK | Full-Text | 89-98 | |
David Trastour; Claudio Bartolini; Chris Preist | |||
If an e-services approach to electronic commerce is to become widespread,
standardisation of ontologies, message content and message protocols will be
necessary. In this paper, we present a lifecycle of a business-to-business
e-commerce interaction, and show how the Semantic Web can support a service
description language that can be used throughout this lifecycle. By using DAML,
we develop a service description language sufficiently expressive and flexible
to be used not only in advertisements, but also in matchmaking queries,
negotiation proposals and agreements. We also identify which operations must be
carried out on this description language if the B2B lifecycle is to be fully
supported. We do not propose specific standard protocols, but instead argue
that our operators are able to support a wide variety of interaction protocols,
and so will be fundamental irrespective of which protocols are finally adopted. Keywords: DAML, automated negotiation, e-commerce, matchmaking, semantic web, service
description |
A probabilistic approach to automated bidding in alternative auctions | | BIBA | Full-Text | 99-108 | |
Marlon Dumas; Lachlan Aldred; Guido Governatori; Arthur ter Hofstede; Nick Russell | |||
This paper presents an approach to develop bidding agents that participate in multiple alternative auctions, with the goal of obtaining an item at the lowest price. The approach consists of a prediction method and a planning algorithm. The prediction method exploits the history of past auctions in order to build probability functions capturing the belief that a bid of a given price may win a given auction. The planning algorithm computes the lowest price, such that by sequentially bidding in a subset of the relevant auctions, the agent can obtain the item at that price with an acceptable probability. The approach addresses the case where the auctions are for substitutable items with different values. Experimental results are reported, showing that the approach increases the payoff of their users and the welfare of the market. |
Law-governed peer-to-peer auctions | | BIBAK | Full-Text | 109-116 | |
Marcus Fontoura; Mihail Ionescu; Naftaly Minsky | |||
This paper proposes a flexible architecture for the creation of Internet
auctions. It allows the custom definition of the auction parameters, and
provides a decentralized control of the auction process. Auction policies are
defined as laws in the Law Governed Interaction (LGI) paradigm. Each of these
laws specifies not only the auction algorithm itself (e.g. open-cry, dutch,
etc.) but also how to handle the other parameters usually involved in the
online auctions, such as certification, auditioning, and treatment of
complaints. LGI is used to enforce the rules established in the auction policy
within the agents involved in the process. After the agents find out about the
actions, they interact in a peer-to-peer communication protocol, reducing the
role of the centralized auction room to an advertising registry, and taking
profit of the distributed nature of the Internet to conduct the auction. The
paper presents an example of an auction law, illustrating the use of the
proposed architecture. Keywords: distributed enforcement, distributed systems, law governed interaction,
online auctions |
Paid placement strategies for internet search engines | | BIBAK | Full-Text | 117-123 | |
Hemant K. Bhargava; Juan Feng | |||
Internet search engines and comparison shopping have recently begun
implementing a paid placement strategy, where some content providers are given
prominent positioning in return for a placement fee. This bias generates
placement revenues but creates a disutility to users, thus reducing user-based
revenues. We formulate the search engine design problem as a tradeoff between
these two types of revenues. We demonstrate that the optimal placement strategy
depends on the relative benefits (to providers) and disutilities (to users) of
paid placement. We compute the optimal placement fee, characterize the optimal
bias level, and analyze sensitivity of the placement strategy to various
factors. In the optimal paid placement strategy, the placement revenues are set
below the monopoly level due to its negative impact on advertising revenues. An
increase in the search engine's quality of service allows it to improve profits
from paid placement, moving it closer to the ideal. However, an increase in the
value-per-user motivates the gatekeeper to increase market share by reducing
further its reliance on paid placement and fraction of paying providers. Keywords: bias, information gatekeepers, paid placement, promotion, search engines |
Parallel crawlers | | BIBAK | Full-Text | 124-135 | |
Junghoo Cho; Hector Garcia-Molina | |||
In this paper we study how we can design an effective parallel crawler. As
the size of the Web grows, it becomes imperative to parallelize a crawling
process, in order to finish downloading pages in a reasonable amount of time.
We first propose multiple architectures for a parallel crawler and identify
fundamental issues related to parallel crawling. Based on this understanding,
we then propose metrics to evaluate a parallel crawler, and compare the
proposed architectures using 40 million pages collected from the Web. Our
results clarify the relative merits of each architecture and provide a good
guideline on when to adopt which architecture. Keywords: parallelization, web crawler, web spider |
Optimal crawling strategies for web search engines | | BIBA | Full-Text | 136-147 | |
J. L. Wolf; M. S. Squillante; P. S. Yu; J. Sethuraman; L. Ozsen | |||
Web Search Engines employ multiple so-called crawlers to maintain local copies of web pages. But these web pages are frequently updated by their owners, and therefore the crawlers must regularly revisit the web pages to maintain the freshness of their local copies. In this paper, we propose a two-part scheme to optimize this crawling process. One goal might be the minimization of the average level of staleness over all web pages, and the scheme we propose can solve this problem. Alternatively, the same basic scheme could be used to minimize a possibly more important search engine embarrassment level metric: The frequency with which a client makes a search engine query and then clicks on a returned url only to find that the result is incorrect. The first part our scheme determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page. It does so within an extremely general stochastic framework, one which supports a wide range of complex update patterns found in practice. It uses techniques from probability theory and the theory of resource allocation problems which are highly computationally efficient -- crucial for practicality because the size of the problem in the web environment is immense. The second part employs these crawling frequencies and ideal crawl times as input, and creates an optimal achievable schedule for the crawlers. Our solution, based on network flow theory, is exact as well as highly efficient. An analysis of the update patterns from a highly accessed and highly dynamic web site is used to gain some insights into the properties of page updates in practice. Then, based on this analysis, we perform a set of detailed simulation experiments to demonstrate the quality and speed of our approach. |
Accelerated focused crawling through online relevance feedback | | BIBAK | Full-Text | 148-159 | |
Soumen Chakrabarti; Kunal Punera; Mallela Subramanyam | |||
The organization of HTML into a tag tree structure, which is rendered by
browsers as roughly rectangular regions with embedded text and HREF links,
greatly helps surfers locate and click on links that best satisfy their
information need. Can an automatic program emulate this human behavior and
thereby learn to predict the relevance of an unseen HREF target page w.r.t. an
information need, based on information limited to the HREF source page? Such a
capability would be of great interest in focused crawling and resource
discovery, because it can fine-tune the priority of unvisited URLs in the crawl
frontier, and reduce the number of irrelevant pages which are fetched and
discarded. Keywords: document object model, focused crawling, reinforcement learning |
Fluid annotations through open hypermedia: using and extending emerging web standards | | BIBAK | Full-Text | 160-171 | |
Niels Olof Bouvin; Polle T. Zellweger; Kaj Grønbæk; Jock D. Mackinlay | |||
The Fluid Documents project has developed various research prototypes that
show that powerful annotation techniques based on animated typographical
changes can help readers utilize annotations more effectively. Our
recently-developed Fluid Open Hypermedia prototype supports the authoring and
browsing of fluid annotations on third-party Web pages. This prototype is an
extension of the Arakne Environment, an open hypermedia application that can
augment Web pages with externally stored hypermedia structures. This paper
describes how various Web standards, including DOM, CSS, XLink, XPointer, and
RDF, can be used and extended to support fluid annotations. Keywords: RDF, XLink, XPointer, annotations, Annotea, fluid documents, web
augmentation with open hypermedia |
Hunter gatherer: interaction support for the creation and management of within-web-page collections | | BIBAK | Full-Text | 172-181 | |
m. c. schraefel; Yuxiang Zhu; David Modjeska; Daniel Wigdor; Shengdong Zhao | |||
Hunter Gatherer is an interface that lets Web users carry out three main
tasks: (1) collect components from within Web pages; (2) represent those
components in a collection; (3) edit those component collections. Our research
shows that while the practice of making collections of content from within Web
pages is common, it is not frequent, due in large part to poor interaction
support in existing tools. We engaged with users in task analysis as well as
iterative design reviews in order to understand the interaction issues that are
part of within-Web-page collection making and to design an interaction that
would support that process.
We report here on that design development, as well as on the evaluations of the tool that evolved from that process, and the future work stemming from these results, in which our critical question is: what happens to users perceptions and expectations of web-based information (their web-based information management practices) when they can treat this information as harvestable, recontextualizable data, rather than as fixed pages? Keywords: attention, collections, information gathering and management, transclusions,
web-based interaction design |
Model checking cobweb protocols for verification of HTML frames behavior | | BIBAK | Full-Text | 182-190 | |
David Stotts; Jaime Navon | |||
HTML documents composed of frames can be difficult to write correctly. We
demonstrate a technique that can be used by authors manually creating HTML
documents (or by document editors) to verify that complex frame construction
exhibits the intended behavior when browsed. The method is based on model
checking (an automated program verification technique), and on temporal logic
specifications of expected frames behavior. We show how to model the HTML
frames source as a CobWeb protocol, related to the Trellis model of hypermedia
documents. We show how to convert the CobWeb protocol to input for a model
checker, and discuss several ways for authors to create the necessary behavior
specifications. Our solution allows Web documents to be built containing a
large number of frames and content pages interacting in complex ways. We expect
such Web structures to be more useful in "literary" hypermedia than for Web
"sites" used as interfaces to organizational information or databases. Keywords: HTML, browsing semantics, formal semantics, frames, literary hypertext,
model checking, temporal logic, verification |
Implementing physical hyperlinks using ubiquitous identifier resolution | | BIBAK | Full-Text | 191-199 | |
Tim Kindberg | |||
Identifier resolution is presented as a way to link the physical world with
virtual Web resources. In this paradigm, designed to support nomadic users, the
user employs a handheld, wirelessly connected, sensor-equipped device to read
identifiers associated with physical entities. The identifiers are resolved
into virtual resources or actions related to the physical entities -- as though
the user 'clicked on a physical hyperlink'. We have integrated identifier
resolution with the Web so that it can be deployed as ubiquitously as the Web,
in the infrastructure and on wirelessly connected handheld devices. We enable
users to capture resolution services and applications as Web resources in their
local context. We use the Web to invoke resolution services, with a model of
'physical' Web form-filling. We propose a scheme for binding identifiers to
resources, to promote services and applications linking the physical and
virtual worlds. Keywords: identifier resolution, mobile computing, nomadic computing, physical
hyperlinks, ubiquitous computing |
Profiles for the situated web | | BIBAK | Full-Text | 200-209 | |
Lalitha Suryanarayana; Johan Hjelm | |||
The World Wide Web is evolving into a medium that will soon make it possible
for conceiving and implementing situation-aware services. A situation-aware or
situated web application is one that renders the user with an experience
(content, interaction and presentation) that is so tailored to his/her current
situation. This requires the facts and opinions regarding the context to be
communicated to the server by means of a profile, which is then applied against
the description of the application objects at the server in order to generate
the required experience. This paper discusses a profiles view of the situated
web architecture and analyzes the key technologies and capabilities that enable
them. We conclude that trusted frameworks wherein rich vocabularies describing
users and their context, applications and documents, along with rules for
processing them, are critical elements of such architectures. Keywords: CC/PP, XML, profiles, situated-aware applications, vocabulary, web
architecture |
The social contract core | | BIBAK | Full-Text | 210-220 | |
James H. Kaufman; Stefan Edlund; Daniel A. Ford; Calvin Powers | |||
The information age has brought with it the promise of unprecedented
economic growth based on the efficiencies made possible by new technology. This
same greater efficiency has left society with less and less time to adapt to
technological progress. Perhaps the greatest cost of this progress is the
threat to privacy we all face from unconstrained exchange of our personal
information. In response to this threat, the World Wide Web Consortium has
introduced the "Platform for Privacy Preferences" (P3P) to allow sites to
express policies in machine-readable form and to expose these policies to site
visitors [1]. However, today P3P does not protect the privacy of individuals,
nor does its implementation empower communities or groups to negotiate and
establish standards of behavior. We propose a privacy architecture we call the
Social Contract Core (SCC), designed to speed the establishment of new "Social
Contracts" needed to protect private data. The goal of SCC is to empower
communities, speed the "socialization" of new technology, and encourage the
rapid access to, and exchange of, information. Addressing these issues is
essential, we feel, to both liberty and economic prosperity in the information
age[2]. Keywords: P3P, privacy, social contract |
Webformulate: a web-based visual continual query system | | BIBAK | Full-Text | 221-231 | |
Jennifer Leopold; Meg Heimovics; Tyler Palmer | |||
Today there is a plethora of data accessible via the Internet. The Web has
greatly simplified the process of searching for, accessing, and sharing
information. However, a considerable amount of Internet-distributed data still
goes unnoticed and unutilized, particularly in the case of frequently-updated,
Internet-distributed databases. In this paper we give an overview of
WebFormulate, a Web-based visual continual query system that addresses the
problems associated with formulating temporal ad hoc analyses over networks of
heterogeneous, frequently-updated data sources. The main distinction between
this system and existing Internet facilities to retrieve information and
assimilate it into computations is that WebFormulate provides the necessary
facilities to perform continual queries, developing and maintaining dynamic
links such that Web-based computations and reports automatically maintain
themselves. A further distinction is that this system is specifically designed
for users of spreadsheet-level ability, rather than professional programmers. Keywords: continual query, visual programming language, visual query system |
A flexible learning system for wrapping tables and lists in HTML documents | | BIBAK | Full-Text | 232-241 | |
William W. Cohen; Matthew Hurst; Lee S. Jensen | |||
A program that makes an existing website look like a database is called a
wrapper. Wrapper learning is the problem of learning website wrappers from
examples. We present a wrapper-learning system called WL2 that can
exploit several different representations of a document. Examples of such
different representations include DOM-level and token-level representations, as
well as two-dimensional geometric views of the rendered page (for tabular data)
and representations of the visual appearance of text as it will be rendered.
Additionally, the learning system is modular, and can be easily adapted to new
domains and tasks. The learning system described is part of an
"industrial-strength" wrapper management system that is in active use at
WhizBang Labs. Controlled experiments show that the learner has broader
coverage and a faster learning rate than earlier wrapper-learning systems. Keywords: canopy, learning, record linkage, reference matching |
A machine learning based approach for table detection on the web | | BIBAK | Full-Text | 242-250 | |
Yalin Wang; Jianying Hu | |||
Table is a commonly used presentation scheme, especially for describing
relational information. However, table understanding remains an open problem.
In this paper, we consider the problem of table detection in web documents. Its
potential applications include web mining, knowledge management, and web
content summarization and delivery to narrow-bandwidth devices. We describe a
machine learning based approach to classify each given table entity as either
genuine or non-genuine. Various features reflecting the layout as well as
content characteristics of tables are studied.
In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf Keywords: decision tree, information retrieval, layout analysis, machine learning,
support vector machine, table detection |
The structure of broad topics on the web | | BIBAK | Full-Text | 251-262 | |
Soumen Chakrabarti; Mukul M. Joshi; Kunal Punera; David M. Pennock | |||
The Web graph is a giant social network whose properties have been measured
and modeled extensively in recent years. Most such studies concentrate on the
graph structure alone, and do not consider textual properties of the nodes.
Consequently, Web communities have been characterized purely in terms of graph
structure and not on page content. We propose that a topic taxonomy such as
Yahoo! or the Open Directory provides a useful framework for understanding the
structure of content-based clusters and communities. In particular, using a
topic taxonomy and an automatic classifier, we can measure the background
distribution of broad topics on the Web, and analyze the capability of recent
random walk algorithms to draw samples which follow such distributions. In
addition, we can measure the probability that a page about one broad topic will
link to another broad topic. Extending this experiment, we can measure how
quickly topic context is lost while walking randomly on the Web graph.
Estimates of this topic mixing distance may explain why a global PageRank is
still meaningful in the context of broad queries. In general, our measurements
may prove valuable in the design of community-specific crawlers and link-based
ranking systems. Keywords: social network analysis, web bibliometry |
A web-based resource migration protocol using WebDAV | | BIBAK | Full-Text | 263-271 | |
Michael Evans; Steven Furnell | |||
The web's hyperlinks are notoriously brittle, and break whenever a resource
migrates. One solution to this problem is a transparent resource migration
mechanism, which separates a resource's location from its identity, and helps
provide referential integrity. However, although several such mechanisms have
been designed, they have not been widely adopted, due largely to a lack of
compliance with current web standards. In addition, these mechanisms must be
updated manually whenever a resource migrates, limiting their effectiveness for
large web sites. Recently, however, new web protocols such as WebDAV (Web
Distributed Authoring and Versioning) have emerged, which extend the HTTP
protocol and provide a new level of control over web resources. In this paper,
we show how we have used these protocols in the design of a new Resource
Migration Protocol (RMP), which enables transparent resource migration across
standard web servers. The RMP works with a new resource migration mechanism we
have developed called the Resource Locator Service (RLS), and is fully
backwards compatible with the web's architecture, enabling all web servers and
all web content to be involved in the migration process. We describe the
protocol and the new RLS in full, together with a prototype implementation and
demonstration applications that we have developed. The paper concludes by
presenting performance data taken from the prototype that show how the RLS will
scale well beyond the size of today's web. Keywords: WebDAV, link rot, referential integrity, resource locator service, resource
migration protocol, web |
A comparison of case-based reasoning approaches | | BIBAK | Full-Text | 272-280 | |
Emilia Mendes; Nile Mosley; Ian Watson | |||
Over the years software engineering researchers have suggested numerous
techniques for estimating development effort. These techniques have been
classified mainly as algorithmic, machine learning and expert judgement.
Several studies have compared the prediction accuracy of those techniques, with
emphasis placed on linear regression, stepwise regression, and Case-based
Reasoning (CBR). To date no converging results have been obtained and we
believe they may be influenced by the use of the same CBR configuration.
The objective of this paper is twofold. First, to describe the application of case-based reasoning for estimating the effort for developing Web hypermedia applications. Second, comparing the prediction accuracy of different CBR configurations, using two Web hypermedia datasets. Results show that for both datasets the best estimations were obtained with weighted Euclidean distance, using either one analogy (dataset 1) or 3 analogies (dataset 2). We suggest therefore that case-based reasoning is a candidate technique for effort estimation and, with the aid of an automated environment, can be applied to Web hypermedia development effort prediction. Keywords: case-based reasoning, prediction models, web effort prediction, web
hypermedia, web hypermedia metrics |
Aliasing on the world wide web: prevalence and performance implications | | BIBAK | Full-Text | 281-292 | |
Terence Kelly; Jeffrey Mogul | |||
Aliasing occurs in Web transactions when requests containing different URLs
elicit replies containing identical data payloads. Conventional caches
associate stored data with URLs and can therefore suffer redundant payload
transfers due to aliasing and other causes. Existing research literature,
however, says little about the prevalence of aliasing in user-initiated
transactions, or about redundant payload transfers in conventional Web cache
hierarchies.
This paper quantifies the extent of aliasing and the performance impact of URL-indexed cache management using a large client trace from WebTV Networks. Fewer than 5% of reply payloads are aliased (referenced via multiple URLs) but over 54% of successful transactions involve aliased payloads. Aliased payloads account for under 3.1% of the trace's "working set size" (sum of payload sizes) but over 36% of bytes transferred. For the WebTV workload, roughly 10% of payload transfers to browser caches and 23% of payload transfers to a shared proxy are redundant, assuming infinite-capacity conventional caches. Our analysis of a large proxy trace from Compaq Corporation yields similar results. URL-indexed caching does not entirely explain the large number of redundant proxy-to-browser payload transfers previously reported in the WebTV system. We consider other possible causes of redundant transfers (e.g., reply metadata and browser cache management policies) and discuss a simple hop-by-hop protocol extension that completely eliminates all redundant transfers, regardless of cause. Keywords: DTD, HTTP, WWW, Zipf's law, aliasing, cache hierarchies, caching, duplicate
suppression, duplicate transfer detection, hypertext transfer protocol,
performance analysis, redundant transfers, resource modification, world wide
web |
Flash crowds and denial of service attacks: characterization and implications for CDNs and web sites | | BIBAK | Full-Text | 293-304 | |
Jaeyeon Jung; Balachander Krishnamurthy; Michael Rabinovich | |||
The paper studies two types of events that often overload Web sites to a
point when their services are degraded or disrupted entirely -- flash events
(FEs) and denial of service attacks (DoS). The former are created by legitimate
requests and the latter contain malicious requests whose goal is to subvert the
normal operation of the site. We study the properties of both types of events
with a special attention to characteristics that distinguish the two.
Identifying these characteristics allows a formulation of a strategy for Web
sites to quickly discard malicious requests. We also show that some content
distribution networks (CDNs) may not provide the desired level of protection to
Web sites against flash events. We therefore propose an enhancement to CDNs
that offers better protection and use trace-driven simulations to study the
effect of our enhancement on CDNs and Web sites. Keywords: content distribution network performance, denial of service attack, flash
crowd, web workload characterization |
Improving web performance by client characterization driven server adaptation | | BIBAK | Full-Text | 305-316 | |
Balachander Krishnamurthy; Craig E. Wills | |||
We categorize the set of clients communicating with a server on the Web
based on information that can be determined by the server. The Web server uses
the information to direct tailored actions. Users with poor connectivity may
choose not to stay at a Web site if it takes a long time to receive a page,
even if the Web server at the site is not the bottleneck. Retaining such
clients may be of interest to a Web site. Better connected clients can receive
enhanced representations of Web pages, such as with higher quality images.
We explore a variety of considerations that could be used by a Web server in characterizing a client. Once a client is characterized as poor or rich, the server can deliver altered content, alter how content is delivered, alter policy and caching decisions, or decide when to redirect the client to a mirror site. We also use network-aware client clustering techniques to provide a coarser level of client categorization and use it to categorize subsequent clients from that cluster for which a client-specific categorization is not available. Our results for client characterization and applicable server actions are derived from real, recent, and diverse set of Web server logs. Our experiments demonstrate that a relatively simple characterization policy can classify poor clients such that these clients subsequently make the majority of badly performing requests to a Web server. This policy is also stable in terms of clients staying in the same class for a large portion of the analysis period. Client clustering can significantly help in initially classifying clients for which no previous information about the client is known. We also show that different server actions can be applied to a significant number of request sequences with poor performance. Keywords: client characterization, client connectivity, server adaptation |
Extracting query modifications from nonlinear SVMs | | BIBAK | Full-Text | 317-324 | |
Gary W. Flake; Eric J. Glover; Steve Lawrence; C. Lee Giles | |||
When searching the WWW, users often desire results restricted to a
particular document category. Ideally, a user would be able to filter results
with a text classifier to minimize false positive results; however, current
search engines allow only simple query modifications. To automate the process
of generating effective query modifications, we introduce a sensitivity
analysis-based method for extracting rules from nonlinear support vector
machines. The proposed method allows the user to specify a desired precision
while attempting to maximize the recall. Our method performs several levels of
dimensionality reduction and is vastly faster than searching the combination
feature space; moreover, it is very effective on real-world data. Keywords: query modification, rule extraction, sensitivity analysis, support vector
machine |
Probabilistic query expansion using query logs | | BIBAK | Full-Text | 325-332 | |
Hang Cui; Ji-Rong Wen; Jian-Yun Nie; Wei-Ying Ma | |||
Query expansion has long been suggested as an effective way to resolve the
short query and word mismatching problems. A number of query expansion methods
have been proposed in traditional information retrieval. However, these
previous methods do not take into account the specific characteristics of web
searching; in particular, of the availability of large amount of user
interaction information recorded in the web query logs. In this study, we
propose a new method for query expansion based on query logs. The central idea
is to extract probabilistic correlations between query terms and document terms
by analyzing query logs. These correlations are then used to select
high-quality expansion terms for new queries. The experimental results show
that our log-based probabilistic query expansion method can greatly improve the
search performance and has several advantages over other existing methods. Keywords: information retrieval, log mining, probabilistic model, query expansion,
search engine |
Expert agreement and content based reranking in a meta search environment using Mearf | | BIBAK | Full-Text | 333-344 | |
B. Uygar Oztekin; George Karypis; Vipin Kumar | |||
Recent increase in the number of search engines on the Web and the
availability of meta search engines that can query multiple search engines
makes it important to find effective methods for combining results coming from
different sources. In this paper we introduce novel methods for reranking in a
meta search environment based on expert agreement and contents of the snippets.
We also introduce an objective way of evaluating different methods for ranking
search results that is based upon implicit user judgements. We incorporated our
methods and two variations of commonly used merging methods in our meta search
engine, Mearf, and carried out an experimental study using logs accumulated
over a period of twelve months. Our experiments show that the choice of the
method used for merging the output produced by different search engines plays a
significant role in the overall quality of the search results. In almost all
cases examined, results produced by some of the new methods introduced were
consistently better than the ones produced by traditional methods commonly used
in various meta search engines. These observations suggest that the proposed
methods can offer a relatively inexpensive way of improving the meta search
experience over existing methods. Keywords: collection fusion, expert agreement, merging, meta search, reranking |
YouServ: a web-hosting and content sharing tool for the masses | | BIBAK | Full-Text | 345-354 | |
Roberto J., Jr. Bayardo; Rakesh Agrawal; Daniel Gruhl; Amit Somani | |||
YouServ is a system that allows its users to pool existing desktop computing
resources for high availability web hosting and file sharing. By exploiting
standard web and internet protocols (e.g. HTTP and DNS), YouServ does not
require those who access YouServ-published content to install special purpose
software. Because it requires minimal server-side resources and administration,
YouServ can be provided at a very low cost. We describe the design,
implementation, and a successful intranet deployment of the YouServ system, and
compare it with several alternatives. Keywords: decentralized systems, p2p, peer-to-peer networks, web hosting |
Dynamic coordination of information management services for processing dynamic web content | | BIBAK | Full-Text | 355-365 | |
In-Young Ko; Ke-Thia Yao; Robert Neches | |||
Dynamic Web content provides us with time-sensitive and continuously
changing data. To glean up-to-date information, users need to regularly browse,
collect and analyze this Web content. Without proper tool support this
information management task is tedious, time-consuming and error prone,
especially when the quantity of the dynamic Web content is large, when many
information management services are needed to analyze it, and when underlying
services/network are not completely reliable. This paper describes a
multi-level, lifecycle (design-time and run-time) coordination mechanism that
enables rapid, efficient development and execution of information management
applications that are especially useful for processing dynamic Web content.
Such a coordination mechanism brings dynamism to coordinating independent,
distributed information management services. Dynamic parallelism spawns/merges
multiple execution service branches based on available data, and dynamic
run-time reconfiguration coordinates service execution to overcome faulty
services and bottlenecks. These features enable information management
applications to be more efficient in handling content and format changes in Web
resources, and enable the applications to be evolved and adapted to process
dynamic Web content. Keywords: dynamic service coordination, dynamic web content, scalable component-based
software systems, semantic interoperability, web information management systems |
Price modeling in standards for electronic product catalogs based on XML | | BIBAK | Full-Text | 366-375 | |
Oliver Kelkar; Joerg Leukel; Volker Schmitz | |||
The fast spreading of electronic business-to-business procurement systems
has led to the development of new standards for the exchange of electronic
product catalogs (e-catalogs). E-catalogs contain various information about
products, essential is price information. Prices are used for buying decisions
and following order transactions. While simple price models are often
sufficient for the description of indirect goods (e.g. office supplies), other
goods and lines of business make higher demands. In this paper we examine what
price information is contained in commercial XML standards for the exchange of
product catalog data. For that purpose we bring the different implicit price
models of the examined catalog standards together and provide a generalized
model. Keywords: B2B, XML, e-business, e-catalog, e-procurement, pricing |
Choosing reputable servents in a P2P network | | BIBAK | Full-Text | 376-386 | |
Fabrizio Cornelli; Ernesto Damiani; Sabrina De Capitani di Vimercati; Stefano Paraboschi; Pierangela Samarati | |||
Peer-to-peer information sharing environments are increasingly gaining
acceptance on the Internet as they provide an infrastructure in which the
desired information can be located and downloaded while preserving the
anonymity of both requestors and providers. As recent experience with P2P
environments such as Gnutella shows, anonymity opens the door to possible
misuses and abuses by resource providers exploiting the network as a way to
spread tampered with resources, including malicious programs, such as Trojan
Horses and viruses.
In this paper we propose an approach to P2P security where servents can keep track, and share with others, information about the reputation of their peers. Reputation sharing is based on a distributed polling algorithm by which resource requestors can assess the reliability of perspective providers before initiating the download. The approach nicely complements the existing P2P protocols and has a limited impact on current implementations. Furthermore, it keeps the current level of anonymity of requestors and providers, as well as that of the parties sharing their view on others' reputations. Keywords: P2P network, credibility, polling protocol, reputation |
Certified email with a light on-line trusted third party: design and implementation | | BIBA | Full-Text | 387-395 | |
Martín Abadi; Neal Glew | |||
This paper presents a new protocol for certified email. The protocol aims to combine security, scalability, easy implementation, and viable deployment. The protocol relies on a light on-line trusted third party; it can be implemented without any special software for the receiver beyond a standard email reader and web browser, and does not require any public-key infrastructure. |
Abstracting application-level web security | | BIBAK | Full-Text | 396-407 | |
David Scott; Richard Sharp | |||
Application-level web security refers to vulnerabilities inherent in the
code of a web-application itself (irrespective of the technologies in which it
is implemented or the security of the web-server/back-end database on which it
is built). In the last few months application-level vulnerabilities have been
exploited with serious consequences: hackers have tricked e-commerce sites into
shipping goods for no charge, user-names and passwords have been harvested and
confidential information (such as addresses and credit-card numbers) has been
leaked.
In this paper we investigate new tools and techniques which address the problem of application-level web security. We (i) describe a scalable structuring mechanism facilitating the abstraction of security policies from large web-applications developed in heterogenous multi-platform environments; (ii) present a tool which assists programmers develop secure applications which are resilient to a wide range of common attacks; and (iii) report results and experience arising from our implementation of these techniques. Keywords: application-Level web security, component-based design, security policy
description language |
Probabilistic question answering on the web | | BIBAK | Full-Text | 408-419 | |
Dragomir Radev; Weiguo Fan; Hong Qi; Harris Wu; Amardeep Grewal | |||
Web-based search engines such as Google and NorthernLight return documents
that are relevant to a user query, not answers to user questions. We have
developed an architecture that augments existing search engines so that they
support natural language question answering. The process entails five steps:
query modulation, document retrieval, passage extraction, phrase extraction,
and answer ranking. In this paper we describe some probabilistic approaches to
the last three of these stages. We show how our techniques apply to a number of
existing search engines and we also present results contrasting three different
methods for question answering. Our algorithm, probabilistic phrase reranking
(PPR) using proximity and question type features achieves a total reciprocal
document rank of .20 on the TREC 8 corpus. Our techniques have been implemented
as a Web-accessible system, called NSIR. Keywords: answer extraction, answer selection, information retrieval, natural language
processing, query modulation, question answering, search engines |
Searching with numbers | | BIBA | Full-Text | 420-431 | |
Rakesh Agrawal; Ramakrishnan Srikant | |||
A large fraction of the useful web comprises of specification documents that largely consist of (attribute name, numeric value) pairs embedded in text. Examples include product information, classified advertisements, resumes, etc. The approach taken in the past to search these documents by first establishing correspondences between values and their names has achieved limited success because of the difficulty of extracting this information from free text. We propose a new approach that does not require this correspondence to be accurately established. Provided the data has "low reflectivity", we can do effective search even if the values in the data have not been assigned attribute names and the user has omitted attribute names in the query. We give algorithms and indexing structures for implementing the search. We also show how hints (i.e, imprecise, partial correspondences) from automatic data extraction techniques can be incorporated into our approach for better accuracy on high reflectivity datasets. Finally, we validate our approach by showing that we get high precision in our answers on real datasets from a variety of domains. |
Evaluating strategies for similarity search on the web | | BIBAK | Full-Text | 432-442 | |
Taher H. Haveliwala; Aristides Gionis; Dan Klein; Piotr Indyk | |||
Finding pages on the Web that are similar to a query page (Related Pages) is
an important component of modern search engines. A variety of strategies have
been proposed for answering Related Pages queries, but comparative evaluation
by user studies is expensive, especially when large strategy spaces must be
searched (e.g., when tuning parameters). We present a technique for
automatically evaluating strategies using Web hierarchies, such as Open
Directory, in place of user feedback. We apply this evaluation methodology to a
mix of document representation strategies, including the use of text,
anchor-text, and links. We discuss the relative advantages and disadvantages of
the various approaches examined. Finally, we describe how to efficiently
construct a similarity index out of our chosen strategies, and provide sample
results from our index. Keywords: evaluation, open directory project, related pages, search, similarity search |
The Yin/Yang web: XML syntax and RDF semantics | | BIBAK | Full-Text | 443-453 | |
Peter Patel-Schneider; Jérôme Siméon | |||
XML is the W3C standard document format for writing and exchanging
information on the Web. RDF is the W3C standard model for describing the
semantics and reasoning about information on the Web. Unfortunately, RDF and
XML -- although very close to each other -- are based on two different
paradigms. We argue that in order to lead the Semantic Web to its full
potential, the syntax and the semantics of information needs to work together.
To this end, we develop a model-theoretic semantics for the XML XQuery 1.0 and
XPath 2.0 Data Model, which provides a unified model for both XML and RDF. This
unified model can serve as the basis for Web applications that deal with both
data and semantics. We illustrate the use of this model on a concrete
information integration scenario. Our approach enables each side of the fence
to benefit from the other, notably, we show how the RDF world can take
advantage of XML query languages, and how the XML world can take advantage of
the reasoning capabilities available for RDF. Keywords: RDF, XML, data models, model theory, semantic web |
Unparsing RDF/XML | | BIBAK | Full-Text | 454-461 | |
Jeremy J. Carroll | |||
It is difficult to serialize an RDF graph as a humanly readable RDF/XML
document. This paper describes the approach taken in Jena 1.2, in which a
design pattern of guarded procedures invoked using top down recursive descent
is used. Each procedure corresponds to a grammar rule; the guard makes the
choice about the applicability of the production. This approach is seen to
correspond closely to the design of an LL(k) parser, and a theoretical
justification of this correspondence is found in universal algebra. Keywords: RDF, XML, generation, grammar, parsing, universal algebra, unparsing |
Authoring and annotation of web pages in CREAM | | BIBAK | Full-Text | 462-473 | |
Siegfried Handschuh; Steffen Staab | |||
Richly interlinked, machine-understandable data constitute the basis for the
Semantic Web. We provide a framework, CREAM, that allows for creation of
metadata. While the annotation mode of CREAM allows to create metadata for
existing web pages, the authoring mode lets authors create metadata -- almost
for free -- while putting together the content of a page.
As a particularity of our framework, CREAM allows to create relational metadata, i.e. metadata that instantiate interrelated definitions of classes in a domain ontology rather than a comparatively rigid template-like schema as Dublin Core. We discuss some of the requirements one has to meet when developing such an ontology-based framework, e.g. the integration of a metadata crawler, inference services, document management and a meta-ontology, and describe its implementation, viz. Ont-O-Mat, a component-based, ontology-driven Web page authoring and annotation tool. Keywords: RDF, annotation, metadata, semanticWeb |
An incremental XSLT transformation processor for XML document manipulation | | BIBAK | Full-Text | 474-485 | |
Lionel Villard; Nabil Layaïda | |||
In this paper, we present an incremental transformation framework called
incXSLT. This framework has been experimented for the XSLT language defined at
the World Wide Web Consortium. For the currently available tools, designing the
XML content and the transformation sheets is an inefficient, a tedious and an
error prone experience. Incremental transformation processors such as incXSLT
represent a better alternative to help in the design of both the content and
the transformation sheets. We believe that such frameworks are a first step
toward fully interactive transformation-based authoring environments. Keywords: XML, XSLT, authoring tools, incremental transformations |
An event-condition-action language for XML | | BIBAK | Full-Text | 486-495 | |
James Bailey; Alexandra Poulovassilis; Peter T. Wood | |||
XML repositories are now a widespread means for storing and exchanging
information on the Web. As these repositories become increasingly used in
dynamic applications such as e-commerce, there is a rapidly growing need for a
mechanism to incorporate reactive functionality in an XML setting.
Event-condition-action (ECA) rules are a technology from active databases and
are a natural method for supporting suchfunctionality. ECA rules can be used
for activities such as automatically enforcing document constraints,
maintaining repository statistics, and facilitating publish/subscribe
applications. An important question associated with the use of a ECA rules is
how to statically predict their run-time behaviour. In this paper, we define a
language for ECA rules on XML repositories. We then investigate methods for
analysing the behaviour of a set of ECA rules, a task which has added
complexity in this XML setting compared with conventional active databases. Keywords: XML, XML repositories, event-condition-action rules, reactive functionality,
rule analysis |
Fast and efficient client-side adaptivity for SVG | | BIBAK | Full-Text | 496-507 | |
Kim Marriott; Bernd Meyer; Laurent Tardif | |||
The Scalable Vector Graphics format SVG is already substantially improving
graphics delivery on the web, but some important issues still remain to be
addressed. In particular, SVG does not support client-side adaption of
documents to different viewing conditions, such as varying screen sizes, style
preferences or different device capabilities. Based on our earlier work we show
how SVG can be extended with constraint-based specification of document layout
to augment it with adaptive capabilities. The core of our proposal is to
include one-way constraints into SVG, which offer more expressiveness than the
previously suggested class of linear constraints and at the same time require
substantially less computational effort. Keywords: CSVG, SVG, adaptivity, constraints, differential scaling, interaction,
scalable vector graphics, semantic zooming |
Web page scoring systems for horizontal and vertical search | | BIBAK | Full-Text | 508-516 | |
Michelangelo Diligenti; Marco Gori; Marco Maggini | |||
Page ranking is a fundamental step towards the construction of effective
search engines for both generic (horizontal) and focused (vertical) search.
Ranking schemes for horizontal search like the PageRank algorithm used by
Google operate on the topology of the graph, regardless of the page content. On
the other hand, the recent development of vertical portals (vortals) makes it
useful to adopt scoring systems focussed on the topic and taking the page
content into account.
In this paper, we propose a general framework for Web Page Scoring Systems (WPSS) which incorporates and extends many of the relevant models proposed in the literature. Finally, experimental results are given to assess the features of the proposed scoring systems with special emphasis on vertical search. Keywords: Focused PageRank, HITS, PageRank, random walks, web page scoring systems |
Topic-sensitive PageRank | | BIBAK | Full-Text | 517-526 | |
Taher H. Haveliwala | |||
In the original PageRank algorithm for improving the ranking of search-query
results, a single PageRank vector is computed, using the link structure of the
Web, to capture the relative "importance" of Web pages, independent of any
particular search query. To yield more accurate search results, we propose
computing a set of PageRank vectors, biased using a set of representative
topics, to capture more accurately the notion of importance with respect to a
particular topic. By using these (precomputed) biased PageRank vectors to
generate query-specific importance scores for pages at query time, we show that
we can generate more accurate rankings than with a single, generic PageRank
vector. For ordinary keyword search queries, we compute the topic-sensitive
PageRank scores for pages satisfying the query using the topic of the query
keywords. For searches done in context (e.g., when the search query is
performed by highlighting words in a Web page), we compute the topic-sensitive
PageRank scores using the topic of the context in which the query appeared. Keywords: PageRank, link structure, personalized search, search, search in context,
web graph |
Improvement of HITS-based algorithms on web documents | | BIBAK | Full-Text | 527-535 | |
Longzhuang Li; Yi Shang; Wei Zhang | |||
In this paper, we present two ways to improve the precision of HITS-based
algorithms on Web documents. First, by analyzing the limitations of current
HITS-based algorithms, we propose a new weighted HITS-based method that assigns
appropriate weights to in-links of root documents. Then, we combine content
analysis with HITS-based algorithms and study the effects of four
representative relevance scoring methods, VSM, Okapi, TLS, and CDR, using a set
of broad topic queries. Our experimental results show that our weighted
HITS-based method performs significantly better than Bharat's improved HITS
algorithm. When we combine our weighted HITS-based method or Bharat's HITS
algorithm with any of the four relevance scoring methods, the combined methods
are only marginally better than our weighted HITS-based method. Between the
four relevance-scoring methods, there is no significant quality difference when
they are combined with a HITS-based algorithm. Keywords: HITS-based algorithms, information retrieval, relevance scoring methods |
Improvements in practical aspects of optimally scheduling web advertising | | BIBAK | Full-Text | 536-541 | |
Atsuyoshi Nakamura | |||
We addressed two issues concerning the practical aspects of optimally
scheduling web advertising proposed by Langheinrich et al. [5], which
scheduling maximizes the total number of click-throughs for all banner
advertisements. One is the problem of multi-impressions in which two or more
banner ads are impressed at the same time. The other is inventory management,
which is important in order to prevent over-selling and maximize revenue. We
propose efficient methods which deal with these two issues. Keywords: electronic commerce, inventory management, on-line advertisement,
optimization, world-wide web |
A lightweight protocol for the generation and distribution of secure e-coupons | | BIBAK | Full-Text | 542-552 | |
Carlo Blundo; Stelvio Cimato; Annalisa De Bonis | |||
A form of advertisement which is becoming very popular on the web is based
on electronic coupon (e-coupon) distribution. E-coupons are the digital
analogue of paper coupons which are used to provide customers with discounts or
gift in order to incentive the purchase of some products. Nowadays, the
potential of digital coupons has not been fully exploited on the web. This is
mostly due to the lack of "efficient" techniques to handle the generation and
distribution of e-coupons. In this paper we discuss models and protocols for
e-coupons satisfying a number of security requirements. Our protocol is
lightweight and preserves the privacy of the users, since it does not require
any registration phase. Keywords: accountability, e-commerce, e-coupons, security |
Protecting electronic commerce from distributed denial-of-service attacks | | BIBAK | Full-Text | 553-561 | |
José Brustoloni | |||
It is widely recognized that distributed denial-of-service (DDoS) attacks
can disrupt electronic commerce and cause large revenue losses. However,
effective defenses continue to be mostly unavailable. We describe and evaluate
VIPnet, a novel value-added network service for protecting e-commerce and other
transaction-based sites from DDoS attacks. In VIPnet, e-merchants pay Internet
Service Providers (ISPs) to carry the packets of the e-merchants' best clients
(called VIPs) in a privileged class of service (CoS), protected from
congestion, whether malicious or not, in the regular CoS. VIPnet rewards VIPs
with not only better quality of service, but also greater availability. Because
VIP rights are client- and server-specific, cannot be forged, are
usage-limited, and are only replenished after successful client transactions
(e.g., purchases), it is impractical for attackers to mount and sustain DDoS
attacks against an e-merchant's VIPs. VIPnet can be deployed incrementally and
does not require universal adoption. Experiments demonstrate VIPnet's benefits. Keywords: denial of service, electronic commerce, quality of service |
Using web structure for classifying and describing web pages | | BIBAK | Full-Text | 562-569 | |
Eric J. Glover; Kostas Tsioutsiouliklis; Steve Lawrence; David M. Pennock; Gary W. Flake | |||
The structure of the web is increasingly being used to improve organization,
search, and analysis of information on the web. For example, Google uses the
text in citing documents (documents that link to the target document) for
search. We analyze the relative utility of document text, and the text in
citing documents near the citation, for classification and description. Results
show that the text in citing documents, when available, often has greater
discriminative and descriptive power than the text in the target document
itself. The combination of evidence from a document and citing documents can
improve on either information source alone. Moreover, by ranking words and
phrases in the citing documents according to expected entropy loss, we are able
to accurately name clusters of web pages, even with very few positive examples.
Our results confirm, quantify, and extend previous research using web structure
in these areas, introducing new methods for classification and description of
pages. Keywords: SVM, anchortext, classification, cluster naming, entropy based feature
extraction, evaluation, web directory, web structure |
ChangeDetector: a site-level monitoring tool for the WWW | | BIBAK | Full-Text | 570-579 | |
Vijay Boyapati; Kristie Chevrier; Avi Finkel; Natalie Glance; Tom Pierce; Robert Stockton; Chip Whitmer | |||
This paper presents a new challenge for Web monitoring tools: to build a
system that can monitor entire web sites effectively. Such a system could
potentially be used to discover "silent news" hidden within corporate web
sites. Examples of silent news include reorganizations in the executive team of
a company or in the retirement of a product line. ChangeDetector, an
implemented prototype, addresses this challenge by incorporating a number of
machine learning techniques. The principal backend components of ChangeDetector
all rely on machine learning: intelligent crawling, page classification and
entity-based change detection. Intelligent crawling enables ChangeDetector to
selectively crawl the most relevant pages of very large sites. Classification
allows change detection to be filtered by topic. Entity extraction over changed
pages permits change detection to be filtered by semantic concepts, such as
person names, dates, addresses, and phone numbers. Finally, the front end
presents a flexible way for subscribers to interact with the database of
detected changes to pinpoint those changes most likely to be of interest. Keywords: URL monitoring, classification, information extraction, intelligent
crawling, machine learning |
Template detection via data mining and its applications | | BIBAK | Full-Text | 580-591 | |
Ziv Bar-Yossef; Sridhar Rajagopalan | |||
We formulate and propose the template detection problem, and suggest a
practical solution for it based on counting frequent item sets. We show that
the use of templates is pervasive on the web. We describe three principles,
which characterize the assumptions made by hypertext information retrieval (IR)
and data mining (DM) systems, and show that templates are a major source of
violation of these principles. As a consequence, basic "pure" implementations
of simple search algorithms coupled with template detection and elimination
show surprising increases in precision at all levels of recall. Keywords: data mining, hypertext, information retrieval, web searching |
RQL: a declarative query language for RDF | | BIBA | Full-Text | 592-603 | |
Gregory Karvounarakis; Sofia Alexaki; Vassilis Christophides; Dimitris Plexousakis; Michel Scholl | |||
Real-scale Semantic Web applications, such as Knowledge Portals and E-Marketplaces, require the management of large volumes of metadata, i.e., information describing the available Web content and services. Better knowledge about their meaning, usage, accessibility or quality will considerably facilitate an automated processing of Web resources. The Resource Description Framework (RDF) enables the creation and exchange of metadata as normal Web data. Although voluminous RDF descriptions are already appearing, sufficiently expressive declarative languages for querying both RDF descriptions and schemas are still missing. In this paper, we propose a new RDF query language called RQL. It is a typed functional language (a la OQL) and relies on a formal model for directed labeled graphs permitting the interpretation of superimposed resource descriptions by means of one or more RDF schemas. RQL adapts the functionality of semistructured/XML query languages to the peculiarities of RDF but, foremost, it enables to uniformly query both resource descriptions and schemas. We illustrate the RQL syntax, semantics and typing system by means of a set of example queries and report on the performance of our persistent RDF Store employed by the RQL interpreter. |
EDUTELLA: a P2P networking infrastructure based on RDF | | BIBAK | Full-Text | 604-615 | |
Wolfgang Nejdl; Boris Wolf; Changtao Qu; Stefan Decker; Michael Sintek; Ambjörn Naeve; Mikael Nilsson; Matthias Palmér; Tore Risch | |||
Metadata for the World Wide Web is important, but metadata for Peer-to-Peer
(P2P) networks is absolutely crucial. In this paper we discuss the open source
project Edutella which builds upon metadata standards defined for the WWW and
aims to provide an RDF-based metadata infrastructure for P2P applications,
building on the recently announced JXTA Framework. We describe the goals and
main services this infrastructure will provide and the architecture to connect
Edutella Peers based on exchange of RDF metadata. As the query service is one
of the core services of Edutella, upon which other services are built, we
specify in detail the Edutella Common Data Model (ECDM) as basis for the
Edutella query exchange language (RDF-QEL-i) and format implementing
distributed queries over the Edutella network. Finally, we shortly discuss
registration and mediation services, and introduce the prototype and
application scenario for our current Edutella aware peers. Keywords: e-Learning, peer-to-peer, query languages, semantic web |
Translating XSLT programs to Efficient SQL queries | | BIBAK | Full-Text | 616-626 | |
Sushant Jain; Ratul Mahajan; Dan Suciu | |||
We present an algorithm for translating XSLT programs into SQL. Our context
is that of virtual XML publishing, in which a single XML view is defined from a
relational database, and subsequently queried with XSLT programs. Each XSLT
program is translated into a single SQL query and run entirely in the database
engine. Our translation works for a large fragment of XSLT, which we define,
that includes descendant/ancestor axis, recursive templates, modes, parameters,
and aggregates. We put considerable effort in generating correct and efficient
SQL queries and describe several optimization techniques to achieve this
efficiency. We have tested our system on all 22 SQL queries of the TPC-H
database benchmark which we represented in XSLT and then translated back to SQL
using our translator. Keywords: SQL, XML, XSLT, query optimization, translation, virtual view |
Personalized pocket directories for mobile devices | | BIBAK | Full-Text | 627-638 | |
Doron Cohen; Michael Herscovici; Yael Petruschka; Yoëlle S. Maarek; Aya Soffer | |||
In spite of the increase in the availability of mobile devices in the last
few years, Web information is not yet as accessible from PDAs or WAP phones as
it is from the desktop. In this paper, we propose a solution for supporting one
of the most popular information discovery mechanisms, namely Web directory
navigation, from mobile devices. Our proposed solution consists of caching
enough information on the device itself in order to conduct most of the
navigation actions locally (with subsecond response time) while intermittently
communicating with the server to receive updates and additional data requested
by the user. The cached information is captured in a "directory capsule". The
directory capsule represents only the portion of the directory that is of
interest to the user in a given context and is sufficiently rich and consistent
to support the information needs of the user in disconnected mode. We define a
novel subscription model specifically geared for Web directories and for the
special needs of PDAs. This subscription model enables users to specify the
parts of the directory that are of interest to them as well as the preferred
granularity. We describe a mechanism for keeping the directory capsule in sync
over time with the Web directory and user subscription requests. Finally, we
present the Pocket Directory Browser for Palm powered computers that we have
developed. The pocket directory can be used to define, view and manipulate the
capsules that are stored on the Palm. We provide several usage examples of our
system on the Open Directory Project, one of the largest and most popular Web
directories. Keywords: hierarchical browsers, mobile devices, mobile search, personalization |
A web middleware architecture for dynamic customization of content for wireless clients | | BIBAK | Full-Text | 639-650 | |
Jesse Steinberg; Joseph Pasquale | |||
We present a new Web middleware architecture that allows users to customize
their view of the Web for optimal interaction and system operation when using
non-traditional resource-limited client machines such as wireless PDAs
(personal digital assistants). Web Stream Customizers (WSC) are dynamically
deployable software modules and can be strategically located between client and
server to achieve improvements in performance, reliability, or security. An
important design feature is that Customizers provide two points of control in
the communication path between client and server, supporting adaptive
system-based and content-based customization. Our architecture exploits HTTP's
proxy capabilities, allowing Customizers to be seamlessly integrated with the
basic Web transaction model. We describe the WSC architecture and
implementation, and illustrate its use with three non-trivial, adaptive
Customizer applications that we have built. We show that the overhead in our
implementation is small and tolerable, and is outweighed by the benefits that
Customizers provide. Keywords: HTTP, middleware, mobile code, proxy, wireless |
Mobile streaming media CDN enabled by dynamic SMIL | | BIBAK | Full-Text | 651-661 | |
Takeshi Yoshimura; Yoshifumi Yonemoto; Tomoyuki Ohya; Minoru Etoh; Susie Wee | |||
In this paper, we present a mobile streaming media CDN (Content Delivery
Network) architecture in which content segmentation, request routing, pre-fetch
scheduling, and session handoff are controlled by SMIL (Synchronized Multimedia
Integrated Language) modification. In this architecture, mobile clients simply
follow modified SMIL files downloaded from a streaming portal server; these
modifications enable multimedia content to be delivered to the mobile clients
from the best surrogates in the CDN. The key components of this architecture
are 1) content segmentation with SMIL modification, 2) on-demand rewriting of
URLs in SMIL, 3) pre-fetch scheduling based on timing information derived from
SMIL, 4) SMIL updates by SOAP (Simple Object Access Protocol) messaging for
session handoffs due to clients mobility. We also introduce QoS control with a
network agent called an "RTP monitoring agent" to enable appropriate control of
media quality based on both network congestion and radio link conditions. The
current status of our prototyping on a mobile QoS testbed "MOBIQ" is reported
in this paper. We are currently designing the SOAP-based APIs (Application
Programmable Interfaces) needed for the mobile streaming media CDN and building
the CDN over the current testbed. Keywords: CDN, SMIL, mobile network, streaming media |
Learning to map between ontologies on the semantic web | | BIBAK | Full-Text | 662-673 | |
AnHai Doan; Jayant Madhavan; Pedro Domingos; Alon Halevy | |||
Ontologies play a prominent role on the Semantic Web. They make possible the
widespread publication of machine understandable data, opening myriad
opportunities for automated information processing. However, because of the
Semantic Web's distributed nature, data on it will inevitably come from many
different ontologies. Information processing across ontologies is not possible
without knowing the semantic mappings between their elements. Manually finding
such mappings is tedious, error-prone, and clearly not possible at the Web
scale. Hence, the development of tools to assist in the ontology mapping
process is crucial to the success of the Semantic Web.
We describe Keywords: machine learning, ontology mapping, relaxation labeling, semantic web |
Vocabulary development for markup languages: a case study with maritime information | | BIBAK | Full-Text | 674-685 | |
Raphael Malyankar | |||
This paper describes the process of constructing a markup language for
maritime information from the starting point of ontology building. Ontology
construction from source materials in the maritime information domain is
outlined. The structure of the markup language is described in terms of XML
schemas and DTDs. A prototype application that uses the markup language is also
described. Keywords: XML, maritime information, ontologies |
A pragmatic application of the semantic web using SemTalk | | BIBAK | Full-Text | 686-692 | |
Christian Fillies; Gay Wood-Albrecht; Frauke Weichardt | |||
The is a new layer of the Internet that enables semantic representation of
the contents of existing web pages. Using common ontologies, human users sketch
out the most important facts in models that act as intelligent whiteboards.
Once models are broadcasted to the Internet, new and intelligent search
engines, "ambient" intelligent devices and agents would be able to exploit this
knowledge network. [1].
The main idea of SemTalk is to empower end users to contribute to the Semantic Web by offering an easy to use MS Visio-based graphical editor to create RDF-like schema and workflows. Since the modeled data is found by Microsoft's SmartTags, users can benefit from these Semantic Webs as part of their daily work with other Microsoft Office products such as Word, Excel or Outlook. This paper presents two applied uses of this technology: SemTalk's graphically configurable meta model also extends the functionality of the Visio modeling tool because it makes it easy to configure Visio to different modeling worlds such as Business Engineering and CASE methodologies but also to these features can be applied to any other Visio drawings. Ontology Project: Department-wide information modeling at the Credit Suisse Bank. Main emphasis was on linguistic standardization of terms. Based on a common central glossary, local knowledge management teams were able to develop specialized models for their decentralized departments. As part of the knowledge management process local glossaries were continually carried over into a common shared model. Business Process Management Project: Distributed process modeling of the Bausparkasse Deutscher Ring, a German financial institution. Several groups of students from the Technical University FH Brandenburg explored how to develop and apply an industry-specific Semantic Web to Business Process Modeling. Keywords: business process modeling, glossary and ontologies, semantic web |
Visualizing web site comparisons | | BIBAK | Full-Text | 693-703 | |
Bing Liu; Kaidi Zhao; Lan Yi | |||
The Web is increasingly becoming an important channel for conducting
businesses, disseminating information, and communicating with people on a
global scale. More and more companies, organizations, and individuals are
publishing their information on the Web. With all this information publicly
available, naturally companies and individuals want to find useful information
from these Web pages. As an example, companies always want to know what their
competitors are doing and what products and services they are offering. Knowing
such information, the companies can learn from their competitors and/or design
countermeasures to improve their own competitiveness. The ability to
effectively find such business intelligence information is increasingly
becoming crucial to the survival and growth of any company. Despite its
importance, little work has been done in this area. In this paper, we propose a
novel visualization technique to help the user find useful information from
his/her competitors' Web site easily and quickly. It involves visualizing (with
the help of a clustering system) the comparison of the user's Web site and the
competitor's Web site to find similarities and differences between the sites.
The visualization is such that with a single glance, the user is able to see
the key similarities and differences of the two sites. He/she can then quickly
focus on those interesting clusters and pages to browse the details. Experiment
results and practical applications show that the technique is effective. Keywords: browsing, business intelligence, user-interface, visualization, web site
comparison |
Web montage: a dynamic personalized start page | | BIBAK | Full-Text | 704-712 | |
Corin R. Anderson; Eric Horvitz | |||
Despite the connotation of the words "browsing" and "surfing," web usage
often follows routine patterns of access. However, few mechanisms exist to
assist users with these routine tasks; bookmarks or portal sites must be
maintained manually and are insensitive to the user's browsing context. To fill
this void, we designed and implemented the Keywords: adaptive user interfaces, adaptive web sites, personalization, user modeling |
Building voiceXML browsers with openVXI | | BIBAK | Full-Text | 713-717 | |
Brian Eberman; Jerry Carter; Darren Meyer; David Goddeau | |||
The OpenVXI is a portable open source based toolkit that interprets the
VoiceXML dialog markup language. It is designed to serve as a framework for
system integrators and platform vendors who want to incorporate VoiceXML into
their platform. A first version of the toolkit was released in the winter of
2001, with a second version released in September of 2001. A number of
companies and individuals have adopted the toolkit for their platforms. In this
paper we discuss the architecture of the toolkit, the architectural issues
involved with implementing a framework for VoiceXML, performance results with
the OpenVXI, and future directions for the toolkit. Keywords: openVXI, voiceXML |
A graphical user interface toolkit approach to thin-client computing | | BIBAK | Full-Text | 718-725 | |
Simon Lok; Steven K. Feiner; William M. Chiong; Yoav J. Hirsch | |||
Network and server-centric computing paradigms are quickly returning to
being the dominant methods by which we use computers. Web applications are so
prevalent that the role of a PC today has been largely reduced to a terminal
for running a client or viewer such as a Web browser. Implementers of
network-centric applications typically rely on the limited capabilities of
HTML, employing proprietary "plug ins" or transmitting the binary image of an
entire application that will be executed on the client. Alternatively,
implementers can develop without regard for remote use, requiring users who
wish to run such applications on a remote server to rely on a system that
creates a virtual frame buffer on the server, and transmits a copy of its
raster image to the local client.
We review some of the problems that these current approaches pose, and show how they can be solved by developing a distributed user interface toolkit. A distributed user interface toolkit applies techniques to the high level components of a toolkit that are similar to those used at a low level in the X Window System. As an example of this approach, we present RemoteJFC, a working distributed user interface toolkit that makes it possible to develop thin-client applications using a distributed version of the Java Foundation Classes. Keywords: client-server systems, network computing, remote method invocation, user
interface toolkit |
Clustering for opportunistic communication | | BIBAK | Full-Text | 726-735 | |
Jay Budzik; Shannon Bradshaw; Xiaobin Fu; Kristian J. Hammond | |||
We describe ongoing work on I2I, a system aimed at fostering opportunistic
communication among users viewing or manipulating content on the Web and in
productivity applications. Unlike previous work in which the URLs of Web
resources are used to group users visiting the same resource, we present a more
general framework for clustering work contexts to group users together that
accounts for dynamic content and distributional properties of Web accesses
which can limit the utility URL based systems. In addition, we describe a
method for scaffolding asynchronous communication in the context of an ongoing
task that takes into account the ephemeral nature of the location of content on
the Web. The techniques we describe also nicely cover local files in progress,
in addition to publicly available Web content. We present the results of
several evaluations that indicate systems that use the techniques we employ may
be more useful than systems that are strictly URL based. Keywords: agents, awareness, clustering, collaboration, context, critical mass,
opportunistic communication |
Featuring web communities based on word co-occurrence structure of communications | | BIBAK | Full-Text | 736-742 | |
Yukio Ohsawa; Hirotaka Soma; Yutaka Matsuo; Naohiro Matsumura; Masaki Usui | |||
Textual communication in message boards is analyzed for classifying Web
communities. We present a communication-content based generalization of an
existing business-oriented classification of Web communities, using KeyGraph, a
method for visualizing the co-occurrence relations between words and word
clusters in text. Here, the text in a message board is analyzed with KeyGraph,
and the structure obtained is shown to reflect the essence of the content-flow.
The relation of this content-flow with participants' interests is then
formalized. Three structure-features of relations between participants and
words, determining the type of the community, are shown to be computed and
visualized: (1) centralization (2) context coherence and (3) creative
decisions. This helps in surveying the essence of a community, e.g. whether the
community creates useful knowledge, how easy it is to join the community, and
whether/why the community is good for making commercial advertisement. Keywords: context, creativity, text mining, web community |