HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 01020304-104-205-105-20607080910

Proceedings of the 2002 International Conference on the World Wide Web

Fullname:Proceedings of the 11th International Conference on World Wide Web
Editors:David Lassner; Dave De Roure; Arun Iyengar
Location:Honolulu, Hawaii, USA
Dates:2002-May-07 to 2002-May-11
Standard No:ISBN: 1-58113-449-5; ACM DL: Table of Contents hcibib: WWW02
Links:Conference Home Page
  1. Performance
  2. Multimedia
  3. Semantic Web Services
  4. Auctions and E-commerce
  5. Crawling
  6. Hypermedia in the Small
  7. Ubiquitous WWW
  8. Extraction and Visualization
  9. Hypermedia in the Large
  10. Performance Workload Char. and Adaptation
  11. Search 1
  12. Applications
  13. Security for Web Applications and P2P
  14. Search 2
  15. Languages & Authoring for the Semantic Web
  16. XML Applications
  17. Link Analysis
  18. Advertising and Security for E-Commerce
  19. Description and Analysis
  20. Query Language for Semantic Web
  21. Mobility and Wireless Access
  22. Ontologies
  23. Browsing
  24. UI and Applications


Cooperative leases: scalable consistency maintenance in content distribution networks BIBAKFull-Text 1-12
  Anoop Ninan; Purushottam Kulkarni; Prashant Shenoy; Krithi Ramamritham; Renu Tewari
In this paper, we argue that cache consistency mechanisms designed for stand-alone proxies do not scale to the large number of proxies in a content distribution network and are not flexible enough to allow consistency guarantees to be tailored to object needs. To meet the twin challenges of scalability and flexibility, we introduce the notion of cooperative consistency along with a mechanism, called cooperative leases, to achieve it. By supporting Δ-consistency semantics and by using a single lease for multiple proxies, cooperative leases allows the notion of leases to be applied in a flexible, scalable manner to CDNs. Further, the approach employs application-level multicast to propagate server notifications to proxies in a scalable manner. We implement our approach in the Apache web server and the Squid proxy cache and demonstrate its efficacy using a detailed experimental evaluation. Our results show a factor of 2.5 reduction in server message overhead and a 20% reduction in server state space overhead when compared to original leases albeit at an increased inter-proxy communication overhead.
Keywords: content distribution networks, data consistency, data dissemination, dynamic data, leases, pullC, push, scalability, world wide web
An evaluation of TCP splice benefits in web proxy servers BIBAKFull-Text 13-24
  A Marcel-Catalin Rosu; A Daniela Rosu
This study is the first to evaluate the performance benefits of using the recently proposed TCP Splice kernel service in Web proxy servers. Previous studies show that splicing client and server TCP connections in the IP layer improves the throughput of proxy servers like firewalls and content routers by reducing the data transfer overheads. In a Web proxy server, data transfer overheads represent a relatively large fraction of the request processing overheads, in particular when content is not cacheable or the proxy cache is memory-based. The study is conducted with a socket-level implementation of TCP Splice. Compared to IP-level implementations, socket-level implementations make possible the splicing of connections with different TCP characteristics, and improve response times by reducing recovery delay after a packet loss. The experimental evaluation is focused on HTTP request types for which the proxy can fully exploit the TCP Splice service, which are the requests for non-cacheable content and SSL tunneling. The experimental testbed includes an emulated WAN environment and benchmark applications for HTTP/1.0 Web client, Web server, and Web proxy running on AIX RS/6000 machines. Our experiments demonstrate that TCP Splice enables reductions in CPU utilization of 10-43% of the CPU, depending on file sizes and request rates. Larger relative reductions are observed when tunneling SSL connections, in particular for small file transfers. Response times are also reduced by up to 1.8sec.
Keywords: TCP splice, web proxy
Clarifying the fundamentals of HTTP BIBAKFull-Text 25-36
  Jeffery C. Mogul
The simplicity of HTTP was a major factor in the success of the Web. However, as both the protocol and its uses have evolved, HTTP has grown complex. This complexity results in numerous problems, including confused implementors, interoperability failures, difficulty in extending the protocol, and a long specification without much documented rationale.
   Many of the problems with HTTP can be traced to unfortunate choices about fundamental definitions and models. This paper analyzes the current (HTTP/1.1) protocol design, showing how it fails in certain cases, and how to improve these fundamentals. Some problems with HTTP can be fixed simply by adopting new models and terminology, allowing us to think more clearly about implementations and extensions. Other problems require explicit (but compatible) protocol changes.
Keywords: HTTP, protocol design


Streaming speech: a framework for generating and streaming 3D text-to-speech and audio presentations to wireless PDAs as specified using extensions to SMIL BIBAKFull-Text 37-44
  Stuart Goose; Sreedhar Kodlahalli; William Pechter; Rune Hjelsvold
While monochrome unformatted text and richly colored graphical content are both capable of conveying a message, well designed graphical content has the potential for better engaging the human sensory system. It is our contention that the author of an audio presentation should be afforded the benefit of judiciously exploiting the human aural perceptual ability to deliver content in a more compelling, concise and realistic manner. While contemporary streaming media players and voice browsers share the ability to render content non-textually, neither technology is currently capable of rendering three dimensional media. The contributions described in this paper are proposed 3D audio extensions to SMIL and a server-based framework able to receive a request and, on-demand, process such a SMIL file and dynamically create the multiple simultaneous audio objects, spatialize them in 3D space, multiplex them into a single stereo audio and prepare it for transmission over an audio stream to a mobile device. To the knowledge of the authors, this is the first reported solution for delivering and rendering on a commercially available wireless handheld device a rich 3D audio listening experience as described by a markup language. Naturally, in addition to mobile devices this solution also works with desktop streaming media players.
Keywords: 3D audio, PDA, SMIL, accessibility, location-based, mobile, spatialization, speech synthesis, streaming, wireless
Multimedia meets computer graphics in SMIL2.0: a time model for the web BIBAKFull-Text 45-53
  Patrick Schmitz
Multimedia scheduling models provide a rich variety of tools for managing the synchronization of media like video and audio, but generally have an inflexible model for time itself. In contrast, modern animation models in the computer graphics community generally lack tools for synchronization and structural time, but allow for a flexible concept of time, including variable pacing, acceleration and deceleration and other tools useful for controlling and adapting animation behaviors. Multimedia authors have been forced to choose one set of features over the others, limiting the range of presentations they can create. Some programming models addressed some of these problems, but provided no declarative means for authors and authoring tools to leverage the functionality. This paper describes a new model incorporated into SMIL 2.0 that combines the strengths of scheduling models with the flexible time manipulations of animation models. The implications of this integration are discussed with respect to scheduling and structured time, drawing upon experience with SMIL 2.0 timing and synchronization, and the integration with XHTML.
Keywords: animation, multimedia, synchronization, timing
OCTOPUS: aggressive search of multi-modality data using multifaceted knowledge base BIBAKFull-Text 54-64
  Jun Yang; Qing Li; Yueting Zhuang
An important trend in Web information processing is the support of multimedia retrieval. However, the most prevailing paradigm for multimedia retrieval, content-based retrieval (CBR), is a rather conservative one whose performance depends on a set of specifically defined low-level features and a carefully chosen sample object. In this paper, an aggressive search mechanism called Octopus is proposed which addresses the retrieval of multi-modality data using multifaceted knowledge. In particular, Octopus promotes a novel scenario in which the user supplies seed objects of arbitrary modality as the hint of his information need, and receives a set of multi-modality objects satisfying his need. The foundation of Octopus is a multifaceted knowledge base constructed on a layered graph model (LGM), which describes the relevance between media objects from various perspectives. Link analysis based retrieval algorithm is proposed based on the LGM. A unique relevance feedback technique is developed to update the knowledge base by learning from user behaviors, and to enhance the retrieval performance in a progressive manner. A prototype implementing the proposed approach has been developed to demonstrate its feasibility and capability through illustrative examples.
Keywords: layered graph model, link analysis, multi-modality data, multifaceted knowledge base, multimedia retrieval, relevance feedback

Semantic Web Services

XL: an XML programming language for web service specification and composition BIBAKFull-Text 65-76
  Daniela Florescu; Andreas Grünhagen; Donald Kossmann
We present an XML programming language specially designed for the implementation of Web services. XL is portable and fully compliant with W3C standards such as XQuery, XML Protocol, and XML Schema. One of the key features of XL is that it allows programmers to concentrate on the logic of their application. XL provides high-level and declarative constructs for actions which are typically carried out in the implementation of a Web service; e.g., logging, error handling, retry of actions, workload management, events, etc. Issues such as performance tuning (e.g., caching, horizontal partitioning, etc.) should be carried out automatically by an implementation of the language. This way, the productivity of the programmers, the ability of evolution of the programs, and the chances to achieve good performance are substantially enhanced.
Keywords: XML, programming language, web service
Simulation, verification and automated composition of web services BIBAKFull-Text 77-88
  Srini Narayanan; Sheila A. McIlraith
Web services -- Web-accessible programs and devices -- are a key application area for the Semantic Web. With the proliferation of Web services and the evolution towards the Semantic Web comes the opportunity to automate various Web services tasks. Our objective is to enable markup and automated reasoning technology to describe, simulate, compose, test, and verify compositions of Web services. We take as our starting point the DAML-S DAML+OIL ontology for describing the capabilities of Web services. We define the semantics for a relevant subset of DAML-S in terms of a first-order logical language. With the semantics in hand, we encode our service descriptions in a Petri Net formalism and provide decision procedures for Web service simulation, verification and composition. We also provide an analysis of the complexity of these tasks under different restrictions to the DAML-S composite services we can describe. Finally, we present an implementation of our analysis techniques. This implementation takes as input a DAML-S description of a Web service, automatically generates a Petri Net and performs the desired analysis. Such a tool has broad applicability both as a back end to existing manual Web service composition tools, and as a stand-alone tool for Web service developers.
Keywords: DAML, automated reasoning, distributed systems, ontologies, semantic web, web service composition, web services
Semantic web support for the business-to-business e-commerce lifecycle BIBAKFull-Text 89-98
  David Trastour; Claudio Bartolini; Chris Preist
If an e-services approach to electronic commerce is to become widespread, standardisation of ontologies, message content and message protocols will be necessary. In this paper, we present a lifecycle of a business-to-business e-commerce interaction, and show how the Semantic Web can support a service description language that can be used throughout this lifecycle. By using DAML, we develop a service description language sufficiently expressive and flexible to be used not only in advertisements, but also in matchmaking queries, negotiation proposals and agreements. We also identify which operations must be carried out on this description language if the B2B lifecycle is to be fully supported. We do not propose specific standard protocols, but instead argue that our operators are able to support a wide variety of interaction protocols, and so will be fundamental irrespective of which protocols are finally adopted.
Keywords: DAML, automated negotiation, e-commerce, matchmaking, semantic web, service description

Auctions and E-commerce

A probabilistic approach to automated bidding in alternative auctions BIBAFull-Text 99-108
  Marlon Dumas; Lachlan Aldred; Guido Governatori; Arthur ter Hofstede; Nick Russell
This paper presents an approach to develop bidding agents that participate in multiple alternative auctions, with the goal of obtaining an item at the lowest price. The approach consists of a prediction method and a planning algorithm. The prediction method exploits the history of past auctions in order to build probability functions capturing the belief that a bid of a given price may win a given auction. The planning algorithm computes the lowest price, such that by sequentially bidding in a subset of the relevant auctions, the agent can obtain the item at that price with an acceptable probability. The approach addresses the case where the auctions are for substitutable items with different values. Experimental results are reported, showing that the approach increases the payoff of their users and the welfare of the market.
Law-governed peer-to-peer auctions BIBAKFull-Text 109-116
  Marcus Fontoura; Mihail Ionescu; Naftaly Minsky
This paper proposes a flexible architecture for the creation of Internet auctions. It allows the custom definition of the auction parameters, and provides a decentralized control of the auction process. Auction policies are defined as laws in the Law Governed Interaction (LGI) paradigm. Each of these laws specifies not only the auction algorithm itself (e.g. open-cry, dutch, etc.) but also how to handle the other parameters usually involved in the online auctions, such as certification, auditioning, and treatment of complaints. LGI is used to enforce the rules established in the auction policy within the agents involved in the process. After the agents find out about the actions, they interact in a peer-to-peer communication protocol, reducing the role of the centralized auction room to an advertising registry, and taking profit of the distributed nature of the Internet to conduct the auction. The paper presents an example of an auction law, illustrating the use of the proposed architecture.
Keywords: distributed enforcement, distributed systems, law governed interaction, online auctions
Paid placement strategies for internet search engines BIBAKFull-Text 117-123
  Hemant K. Bhargava; Juan Feng
Internet search engines and comparison shopping have recently begun implementing a paid placement strategy, where some content providers are given prominent positioning in return for a placement fee. This bias generates placement revenues but creates a disutility to users, thus reducing user-based revenues. We formulate the search engine design problem as a tradeoff between these two types of revenues. We demonstrate that the optimal placement strategy depends on the relative benefits (to providers) and disutilities (to users) of paid placement. We compute the optimal placement fee, characterize the optimal bias level, and analyze sensitivity of the placement strategy to various factors. In the optimal paid placement strategy, the placement revenues are set below the monopoly level due to its negative impact on advertising revenues. An increase in the search engine's quality of service allows it to improve profits from paid placement, moving it closer to the ideal. However, an increase in the value-per-user motivates the gatekeeper to increase market share by reducing further its reliance on paid placement and fraction of paying providers.
Keywords: bias, information gatekeepers, paid placement, promotion, search engines


Parallel crawlers BIBAKFull-Text 124-135
  Junghoo Cho; Hector Garcia-Molina
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.
Keywords: parallelization, web crawler, web spider
Optimal crawling strategies for web search engines BIBAFull-Text 136-147
  J. L. Wolf; M. S. Squillante; P. S. Yu; J. Sethuraman; L. Ozsen
Web Search Engines employ multiple so-called crawlers to maintain local copies of web pages. But these web pages are frequently updated by their owners, and therefore the crawlers must regularly revisit the web pages to maintain the freshness of their local copies. In this paper, we propose a two-part scheme to optimize this crawling process. One goal might be the minimization of the average level of staleness over all web pages, and the scheme we propose can solve this problem. Alternatively, the same basic scheme could be used to minimize a possibly more important search engine embarrassment level metric: The frequency with which a client makes a search engine query and then clicks on a returned url only to find that the result is incorrect. The first part our scheme determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page. It does so within an extremely general stochastic framework, one which supports a wide range of complex update patterns found in practice. It uses techniques from probability theory and the theory of resource allocation problems which are highly computationally efficient -- crucial for practicality because the size of the problem in the web environment is immense. The second part employs these crawling frequencies and ideal crawl times as input, and creates an optimal achievable schedule for the crawlers. Our solution, based on network flow theory, is exact as well as highly efficient. An analysis of the update patterns from a highly accessed and highly dynamic web site is used to gain some insights into the properties of page updates in practice. Then, based on this analysis, we perform a set of detailed simulation experiments to demonstrate the quality and speed of our approach.
Accelerated focused crawling through online relevance feedback BIBAKFull-Text 148-159
  Soumen Chakrabarti; Kunal Punera; Mallela Subramanyam
The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded.
Keywords: document object model, focused crawling, reinforcement learning

Hypermedia in the Small

Fluid annotations through open hypermedia: using and extending emerging web standards BIBAKFull-Text 160-171
  Niels Olof Bouvin; Polle T. Zellweger; Kaj Grønbæk; Jock D. Mackinlay
The Fluid Documents project has developed various research prototypes that show that powerful annotation techniques based on animated typographical changes can help readers utilize annotations more effectively. Our recently-developed Fluid Open Hypermedia prototype supports the authoring and browsing of fluid annotations on third-party Web pages. This prototype is an extension of the Arakne Environment, an open hypermedia application that can augment Web pages with externally stored hypermedia structures. This paper describes how various Web standards, including DOM, CSS, XLink, XPointer, and RDF, can be used and extended to support fluid annotations.
Keywords: RDF, XLink, XPointer, annotations, Annotea, fluid documents, web augmentation with open hypermedia
Hunter gatherer: interaction support for the creation and management of within-web-page collections BIBAKFull-Text 172-181
  m. c. schraefel; Yuxiang Zhu; David Modjeska; Daniel Wigdor; Shengdong Zhao
Hunter Gatherer is an interface that lets Web users carry out three main tasks: (1) collect components from within Web pages; (2) represent those components in a collection; (3) edit those component collections. Our research shows that while the practice of making collections of content from within Web pages is common, it is not frequent, due in large part to poor interaction support in existing tools. We engaged with users in task analysis as well as iterative design reviews in order to understand the interaction issues that are part of within-Web-page collection making and to design an interaction that would support that process.
   We report here on that design development, as well as on the evaluations of the tool that evolved from that process, and the future work stemming from these results, in which our critical question is: what happens to users perceptions and expectations of web-based information (their web-based information management practices) when they can treat this information as harvestable, recontextualizable data, rather than as fixed pages?
Keywords: attention, collections, information gathering and management, transclusions, web-based interaction design
Model checking cobweb protocols for verification of HTML frames behavior BIBAKFull-Text 182-190
  David Stotts; Jaime Navon
HTML documents composed of frames can be difficult to write correctly. We demonstrate a technique that can be used by authors manually creating HTML documents (or by document editors) to verify that complex frame construction exhibits the intended behavior when browsed. The method is based on model checking (an automated program verification technique), and on temporal logic specifications of expected frames behavior. We show how to model the HTML frames source as a CobWeb protocol, related to the Trellis model of hypermedia documents. We show how to convert the CobWeb protocol to input for a model checker, and discuss several ways for authors to create the necessary behavior specifications. Our solution allows Web documents to be built containing a large number of frames and content pages interacting in complex ways. We expect such Web structures to be more useful in "literary" hypermedia than for Web "sites" used as interfaces to organizational information or databases.
Keywords: HTML, browsing semantics, formal semantics, frames, literary hypertext, model checking, temporal logic, verification

Ubiquitous WWW

Implementing physical hyperlinks using ubiquitous identifier resolution BIBAKFull-Text 191-199
  Tim Kindberg
Identifier resolution is presented as a way to link the physical world with virtual Web resources. In this paradigm, designed to support nomadic users, the user employs a handheld, wirelessly connected, sensor-equipped device to read identifiers associated with physical entities. The identifiers are resolved into virtual resources or actions related to the physical entities -- as though the user 'clicked on a physical hyperlink'. We have integrated identifier resolution with the Web so that it can be deployed as ubiquitously as the Web, in the infrastructure and on wirelessly connected handheld devices. We enable users to capture resolution services and applications as Web resources in their local context. We use the Web to invoke resolution services, with a model of 'physical' Web form-filling. We propose a scheme for binding identifiers to resources, to promote services and applications linking the physical and virtual worlds.
Keywords: identifier resolution, mobile computing, nomadic computing, physical hyperlinks, ubiquitous computing
Profiles for the situated web BIBAKFull-Text 200-209
  Lalitha Suryanarayana; Johan Hjelm
The World Wide Web is evolving into a medium that will soon make it possible for conceiving and implementing situation-aware services. A situation-aware or situated web application is one that renders the user with an experience (content, interaction and presentation) that is so tailored to his/her current situation. This requires the facts and opinions regarding the context to be communicated to the server by means of a profile, which is then applied against the description of the application objects at the server in order to generate the required experience. This paper discusses a profiles view of the situated web architecture and analyzes the key technologies and capabilities that enable them. We conclude that trusted frameworks wherein rich vocabularies describing users and their context, applications and documents, along with rules for processing them, are critical elements of such architectures.
Keywords: CC/PP, XML, profiles, situated-aware applications, vocabulary, web architecture
The social contract core BIBAKFull-Text 210-220
  James H. Kaufman; Stefan Edlund; Daniel A. Ford; Calvin Powers
The information age has brought with it the promise of unprecedented economic growth based on the efficiencies made possible by new technology. This same greater efficiency has left society with less and less time to adapt to technological progress. Perhaps the greatest cost of this progress is the threat to privacy we all face from unconstrained exchange of our personal information. In response to this threat, the World Wide Web Consortium has introduced the "Platform for Privacy Preferences" (P3P) to allow sites to express policies in machine-readable form and to expose these policies to site visitors [1]. However, today P3P does not protect the privacy of individuals, nor does its implementation empower communities or groups to negotiate and establish standards of behavior. We propose a privacy architecture we call the Social Contract Core (SCC), designed to speed the establishment of new "Social Contracts" needed to protect private data. The goal of SCC is to empower communities, speed the "socialization" of new technology, and encourage the rapid access to, and exchange of, information. Addressing these issues is essential, we feel, to both liberty and economic prosperity in the information age[2].
Keywords: P3P, privacy, social contract

Extraction and Visualization

Webformulate: a web-based visual continual query system BIBAKFull-Text 221-231
  Jennifer Leopold; Meg Heimovics; Tyler Palmer
Today there is a plethora of data accessible via the Internet. The Web has greatly simplified the process of searching for, accessing, and sharing information. However, a considerable amount of Internet-distributed data still goes unnoticed and unutilized, particularly in the case of frequently-updated, Internet-distributed databases. In this paper we give an overview of WebFormulate, a Web-based visual continual query system that addresses the problems associated with formulating temporal ad hoc analyses over networks of heterogeneous, frequently-updated data sources. The main distinction between this system and existing Internet facilities to retrieve information and assimilate it into computations is that WebFormulate provides the necessary facilities to perform continual queries, developing and maintaining dynamic links such that Web-based computations and reports automatically maintain themselves. A further distinction is that this system is specifically designed for users of spreadsheet-level ability, rather than professional programmers.
Keywords: continual query, visual programming language, visual query system
A flexible learning system for wrapping tables and lists in HTML documents BIBAKFull-Text 232-241
  William W. Cohen; Matthew Hurst; Lee S. Jensen
A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL2 that can exploit several different representations of a document. Examples of such different representations include DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text as it will be rendered. Additionally, the learning system is modular, and can be easily adapted to new domains and tasks. The learning system described is part of an "industrial-strength" wrapper management system that is in active use at WhizBang Labs. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.
Keywords: canopy, learning, record linkage, reference matching
A machine learning based approach for table detection on the web BIBAKFull-Text 242-250
  Yalin Wang; Jianying Hu
Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.
   In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.
Keywords: decision tree, information retrieval, layout analysis, machine learning, support vector machine, table detection

Hypermedia in the Large

The structure of broad topics on the web BIBAKFull-Text 251-262
  Soumen Chakrabarti; Mukul M. Joshi; Kunal Punera; David M. Pennock
The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.
Keywords: social network analysis, web bibliometry
A web-based resource migration protocol using WebDAV BIBAKFull-Text 263-271
  Michael Evans; Steven Furnell
The web's hyperlinks are notoriously brittle, and break whenever a resource migrates. One solution to this problem is a transparent resource migration mechanism, which separates a resource's location from its identity, and helps provide referential integrity. However, although several such mechanisms have been designed, they have not been widely adopted, due largely to a lack of compliance with current web standards. In addition, these mechanisms must be updated manually whenever a resource migrates, limiting their effectiveness for large web sites. Recently, however, new web protocols such as WebDAV (Web Distributed Authoring and Versioning) have emerged, which extend the HTTP protocol and provide a new level of control over web resources. In this paper, we show how we have used these protocols in the design of a new Resource Migration Protocol (RMP), which enables transparent resource migration across standard web servers. The RMP works with a new resource migration mechanism we have developed called the Resource Locator Service (RLS), and is fully backwards compatible with the web's architecture, enabling all web servers and all web content to be involved in the migration process. We describe the protocol and the new RLS in full, together with a prototype implementation and demonstration applications that we have developed. The paper concludes by presenting performance data taken from the prototype that show how the RLS will scale well beyond the size of today's web.
Keywords: WebDAV, link rot, referential integrity, resource locator service, resource migration protocol, web
A comparison of case-based reasoning approaches BIBAKFull-Text 272-280
  Emilia Mendes; Nile Mosley; Ian Watson
Over the years software engineering researchers have suggested numerous techniques for estimating development effort. These techniques have been classified mainly as algorithmic, machine learning and expert judgement. Several studies have compared the prediction accuracy of those techniques, with emphasis placed on linear regression, stepwise regression, and Case-based Reasoning (CBR). To date no converging results have been obtained and we believe they may be influenced by the use of the same CBR configuration.
   The objective of this paper is twofold. First, to describe the application of case-based reasoning for estimating the effort for developing Web hypermedia applications. Second, comparing the prediction accuracy of different CBR configurations, using two Web hypermedia datasets.
   Results show that for both datasets the best estimations were obtained with weighted Euclidean distance, using either one analogy (dataset 1) or 3 analogies (dataset 2). We suggest therefore that case-based reasoning is a candidate technique for effort estimation and, with the aid of an automated environment, can be applied to Web hypermedia development effort prediction.
Keywords: case-based reasoning, prediction models, web effort prediction, web hypermedia, web hypermedia metrics

Performance Workload Char. and Adaptation

Aliasing on the world wide web: prevalence and performance implications BIBAKFull-Text 281-292
  Terence Kelly; Jeffrey Mogul
Aliasing occurs in Web transactions when requests containing different URLs elicit replies containing identical data payloads. Conventional caches associate stored data with URLs and can therefore suffer redundant payload transfers due to aliasing and other causes. Existing research literature, however, says little about the prevalence of aliasing in user-initiated transactions, or about redundant payload transfers in conventional Web cache hierarchies.
   This paper quantifies the extent of aliasing and the performance impact of URL-indexed cache management using a large client trace from WebTV Networks. Fewer than 5% of reply payloads are aliased (referenced via multiple URLs) but over 54% of successful transactions involve aliased payloads. Aliased payloads account for under 3.1% of the trace's "working set size" (sum of payload sizes) but over 36% of bytes transferred. For the WebTV workload, roughly 10% of payload transfers to browser caches and 23% of payload transfers to a shared proxy are redundant, assuming infinite-capacity conventional caches. Our analysis of a large proxy trace from Compaq Corporation yields similar results.
   URL-indexed caching does not entirely explain the large number of redundant proxy-to-browser payload transfers previously reported in the WebTV system. We consider other possible causes of redundant transfers (e.g., reply metadata and browser cache management policies) and discuss a simple hop-by-hop protocol extension that completely eliminates all redundant transfers, regardless of cause.
Keywords: DTD, HTTP, WWW, Zipf's law, aliasing, cache hierarchies, caching, duplicate suppression, duplicate transfer detection, hypertext transfer protocol, performance analysis, redundant transfers, resource modification, world wide web
Flash crowds and denial of service attacks: characterization and implications for CDNs and web sites BIBAKFull-Text 293-304
  Jaeyeon Jung; Balachander Krishnamurthy; Michael Rabinovich
The paper studies two types of events that often overload Web sites to a point when their services are degraded or disrupted entirely -- flash events (FEs) and denial of service attacks (DoS). The former are created by legitimate requests and the latter contain malicious requests whose goal is to subvert the normal operation of the site. We study the properties of both types of events with a special attention to characteristics that distinguish the two. Identifying these characteristics allows a formulation of a strategy for Web sites to quickly discard malicious requests. We also show that some content distribution networks (CDNs) may not provide the desired level of protection to Web sites against flash events. We therefore propose an enhancement to CDNs that offers better protection and use trace-driven simulations to study the effect of our enhancement on CDNs and Web sites.
Keywords: content distribution network performance, denial of service attack, flash crowd, web workload characterization
Improving web performance by client characterization driven server adaptation BIBAKFull-Text 305-316
  Balachander Krishnamurthy; Craig E. Wills
We categorize the set of clients communicating with a server on the Web based on information that can be determined by the server. The Web server uses the information to direct tailored actions. Users with poor connectivity may choose not to stay at a Web site if it takes a long time to receive a page, even if the Web server at the site is not the bottleneck. Retaining such clients may be of interest to a Web site. Better connected clients can receive enhanced representations of Web pages, such as with higher quality images.
   We explore a variety of considerations that could be used by a Web server in characterizing a client. Once a client is characterized as poor or rich, the server can deliver altered content, alter how content is delivered, alter policy and caching decisions, or decide when to redirect the client to a mirror site. We also use network-aware client clustering techniques to provide a coarser level of client categorization and use it to categorize subsequent clients from that cluster for which a client-specific categorization is not available.
   Our results for client characterization and applicable server actions are derived from real, recent, and diverse set of Web server logs. Our experiments demonstrate that a relatively simple characterization policy can classify poor clients such that these clients subsequently make the majority of badly performing requests to a Web server. This policy is also stable in terms of clients staying in the same class for a large portion of the analysis period. Client clustering can significantly help in initially classifying clients for which no previous information about the client is known. We also show that different server actions can be applied to a significant number of request sequences with poor performance.
Keywords: client characterization, client connectivity, server adaptation

Search 1

Extracting query modifications from nonlinear SVMs BIBAKFull-Text 317-324
  Gary W. Flake; Eric J. Glover; Steve Lawrence; C. Lee Giles
When searching the WWW, users often desire results restricted to a particular document category. Ideally, a user would be able to filter results with a text classifier to minimize false positive results; however, current search engines allow only simple query modifications. To automate the process of generating effective query modifications, we introduce a sensitivity analysis-based method for extracting rules from nonlinear support vector machines. The proposed method allows the user to specify a desired precision while attempting to maximize the recall. Our method performs several levels of dimensionality reduction and is vastly faster than searching the combination feature space; moreover, it is very effective on real-world data.
Keywords: query modification, rule extraction, sensitivity analysis, support vector machine
Probabilistic query expansion using query logs BIBAKFull-Text 325-332
  Hang Cui; Ji-Rong Wen; Jian-Yun Nie; Wei-Ying Ma
Query expansion has long been suggested as an effective way to resolve the short query and word mismatching problems. A number of query expansion methods have been proposed in traditional information retrieval. However, these previous methods do not take into account the specific characteristics of web searching; in particular, of the availability of large amount of user interaction information recorded in the web query logs. In this study, we propose a new method for query expansion based on query logs. The central idea is to extract probabilistic correlations between query terms and document terms by analyzing query logs. These correlations are then used to select high-quality expansion terms for new queries. The experimental results show that our log-based probabilistic query expansion method can greatly improve the search performance and has several advantages over other existing methods.
Keywords: information retrieval, log mining, probabilistic model, query expansion, search engine
Expert agreement and content based reranking in a meta search environment using Mearf BIBAKFull-Text 333-344
  B. Uygar Oztekin; George Karypis; Vipin Kumar
Recent increase in the number of search engines on the Web and the availability of meta search engines that can query multiple search engines makes it important to find effective methods for combining results coming from different sources. In this paper we introduce novel methods for reranking in a meta search environment based on expert agreement and contents of the snippets. We also introduce an objective way of evaluating different methods for ranking search results that is based upon implicit user judgements. We incorporated our methods and two variations of commonly used merging methods in our meta search engine, Mearf, and carried out an experimental study using logs accumulated over a period of twelve months. Our experiments show that the choice of the method used for merging the output produced by different search engines plays a significant role in the overall quality of the search results. In almost all cases examined, results produced by some of the new methods introduced were consistently better than the ones produced by traditional methods commonly used in various meta search engines. These observations suggest that the proposed methods can offer a relatively inexpensive way of improving the meta search experience over existing methods.
Keywords: collection fusion, expert agreement, merging, meta search, reranking


YouServ: a web-hosting and content sharing tool for the masses BIBAKFull-Text 345-354
  Roberto J., Jr. Bayardo; Rakesh Agrawal; Daniel Gruhl; Amit Somani
YouServ is a system that allows its users to pool existing desktop computing resources for high availability web hosting and file sharing. By exploiting standard web and internet protocols (e.g. HTTP and DNS), YouServ does not require those who access YouServ-published content to install special purpose software. Because it requires minimal server-side resources and administration, YouServ can be provided at a very low cost. We describe the design, implementation, and a successful intranet deployment of the YouServ system, and compare it with several alternatives.
Keywords: decentralized systems, p2p, peer-to-peer networks, web hosting
Dynamic coordination of information management services for processing dynamic web content BIBAKFull-Text 355-365
  In-Young Ko; Ke-Thia Yao; Robert Neches
Dynamic Web content provides us with time-sensitive and continuously changing data. To glean up-to-date information, users need to regularly browse, collect and analyze this Web content. Without proper tool support this information management task is tedious, time-consuming and error prone, especially when the quantity of the dynamic Web content is large, when many information management services are needed to analyze it, and when underlying services/network are not completely reliable. This paper describes a multi-level, lifecycle (design-time and run-time) coordination mechanism that enables rapid, efficient development and execution of information management applications that are especially useful for processing dynamic Web content. Such a coordination mechanism brings dynamism to coordinating independent, distributed information management services. Dynamic parallelism spawns/merges multiple execution service branches based on available data, and dynamic run-time reconfiguration coordinates service execution to overcome faulty services and bottlenecks. These features enable information management applications to be more efficient in handling content and format changes in Web resources, and enable the applications to be evolved and adapted to process dynamic Web content.
Keywords: dynamic service coordination, dynamic web content, scalable component-based software systems, semantic interoperability, web information management systems
Price modeling in standards for electronic product catalogs based on XML BIBAKFull-Text 366-375
  Oliver Kelkar; Joerg Leukel; Volker Schmitz
The fast spreading of electronic business-to-business procurement systems has led to the development of new standards for the exchange of electronic product catalogs (e-catalogs). E-catalogs contain various information about products, essential is price information. Prices are used for buying decisions and following order transactions. While simple price models are often sufficient for the description of indirect goods (e.g. office supplies), other goods and lines of business make higher demands. In this paper we examine what price information is contained in commercial XML standards for the exchange of product catalog data. For that purpose we bring the different implicit price models of the examined catalog standards together and provide a generalized model.
Keywords: B2B, XML, e-business, e-catalog, e-procurement, pricing

Security for Web Applications and P2P

Choosing reputable servents in a P2P network BIBAKFull-Text 376-386
  Fabrizio Cornelli; Ernesto Damiani; Sabrina De Capitani di Vimercati; Stefano Paraboschi; Pierangela Samarati
Peer-to-peer information sharing environments are increasingly gaining acceptance on the Internet as they provide an infrastructure in which the desired information can be located and downloaded while preserving the anonymity of both requestors and providers. As recent experience with P2P environments such as Gnutella shows, anonymity opens the door to possible misuses and abuses by resource providers exploiting the network as a way to spread tampered with resources, including malicious programs, such as Trojan Horses and viruses.
   In this paper we propose an approach to P2P security where servents can keep track, and share with others, information about the reputation of their peers. Reputation sharing is based on a distributed polling algorithm by which resource requestors can assess the reliability of perspective providers before initiating the download. The approach nicely complements the existing P2P protocols and has a limited impact on current implementations. Furthermore, it keeps the current level of anonymity of requestors and providers, as well as that of the parties sharing their view on others' reputations.
Keywords: P2P network, credibility, polling protocol, reputation
Certified email with a light on-line trusted third party: design and implementation BIBAFull-Text 387-395
  Martín Abadi; Neal Glew
This paper presents a new protocol for certified email. The protocol aims to combine security, scalability, easy implementation, and viable deployment. The protocol relies on a light on-line trusted third party; it can be implemented without any special software for the receiver beyond a standard email reader and web browser, and does not require any public-key infrastructure.
Abstracting application-level web security BIBAKFull-Text 396-407
  David Scott; Richard Sharp
Application-level web security refers to vulnerabilities inherent in the code of a web-application itself (irrespective of the technologies in which it is implemented or the security of the web-server/back-end database on which it is built). In the last few months application-level vulnerabilities have been exploited with serious consequences: hackers have tricked e-commerce sites into shipping goods for no charge, user-names and passwords have been harvested and confidential information (such as addresses and credit-card numbers) has been leaked.
   In this paper we investigate new tools and techniques which address the problem of application-level web security. We (i) describe a scalable structuring mechanism facilitating the abstraction of security policies from large web-applications developed in heterogenous multi-platform environments; (ii) present a tool which assists programmers develop secure applications which are resilient to a wide range of common attacks; and (iii) report results and experience arising from our implementation of these techniques.
Keywords: application-Level web security, component-based design, security policy description language

Search 2

Probabilistic question answering on the web BIBAKFull-Text 408-419
  Dragomir Radev; Weiguo Fan; Hong Qi; Harris Wu; Amardeep Grewal
Web-based search engines such as Google and NorthernLight return documents that are relevant to a user query, not answers to user questions. We have developed an architecture that augments existing search engines so that they support natural language question answering. The process entails five steps: query modulation, document retrieval, passage extraction, phrase extraction, and answer ranking. In this paper we describe some probabilistic approaches to the last three of these stages. We show how our techniques apply to a number of existing search engines and we also present results contrasting three different methods for question answering. Our algorithm, probabilistic phrase reranking (PPR) using proximity and question type features achieves a total reciprocal document rank of .20 on the TREC 8 corpus. Our techniques have been implemented as a Web-accessible system, called NSIR.
Keywords: answer extraction, answer selection, information retrieval, natural language processing, query modulation, question answering, search engines
Searching with numbers BIBAFull-Text 420-431
  Rakesh Agrawal; Ramakrishnan Srikant
A large fraction of the useful web comprises of specification documents that largely consist of (attribute name, numeric value) pairs embedded in text. Examples include product information, classified advertisements, resumes, etc. The approach taken in the past to search these documents by first establishing correspondences between values and their names has achieved limited success because of the difficulty of extracting this information from free text. We propose a new approach that does not require this correspondence to be accurately established. Provided the data has "low reflectivity", we can do effective search even if the values in the data have not been assigned attribute names and the user has omitted attribute names in the query. We give algorithms and indexing structures for implementing the search. We also show how hints (i.e, imprecise, partial correspondences) from automatic data extraction techniques can be incorporated into our approach for better accuracy on high reflectivity datasets. Finally, we validate our approach by showing that we get high precision in our answers on real datasets from a variety of domains.
Evaluating strategies for similarity search on the web BIBAKFull-Text 432-442
  Taher H. Haveliwala; Aristides Gionis; Dan Klein; Piotr Indyk
Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to efficiently construct a similarity index out of our chosen strategies, and provide sample results from our index.
Keywords: evaluation, open directory project, related pages, search, similarity search

Languages & Authoring for the Semantic Web

The Yin/Yang web: XML syntax and RDF semantics BIBAKFull-Text 443-453
  Peter Patel-Schneider; Jérôme Siméon
XML is the W3C standard document format for writing and exchanging information on the Web. RDF is the W3C standard model for describing the semantics and reasoning about information on the Web. Unfortunately, RDF and XML -- although very close to each other -- are based on two different paradigms. We argue that in order to lead the Semantic Web to its full potential, the syntax and the semantics of information needs to work together. To this end, we develop a model-theoretic semantics for the XML XQuery 1.0 and XPath 2.0 Data Model, which provides a unified model for both XML and RDF. This unified model can serve as the basis for Web applications that deal with both data and semantics. We illustrate the use of this model on a concrete information integration scenario. Our approach enables each side of the fence to benefit from the other, notably, we show how the RDF world can take advantage of XML query languages, and how the XML world can take advantage of the reasoning capabilities available for RDF.
Keywords: RDF, XML, data models, model theory, semantic web
Unparsing RDF/XML BIBAKFull-Text 454-461
  Jeremy J. Carroll
It is difficult to serialize an RDF graph as a humanly readable RDF/XML document. This paper describes the approach taken in Jena 1.2, in which a design pattern of guarded procedures invoked using top down recursive descent is used. Each procedure corresponds to a grammar rule; the guard makes the choice about the applicability of the production. This approach is seen to correspond closely to the design of an LL(k) parser, and a theoretical justification of this correspondence is found in universal algebra.
Keywords: RDF, XML, generation, grammar, parsing, universal algebra, unparsing
Authoring and annotation of web pages in CREAM BIBAKFull-Text 462-473
  Siegfried Handschuh; Steffen Staab
Richly interlinked, machine-understandable data constitute the basis for the Semantic Web. We provide a framework, CREAM, that allows for creation of metadata. While the annotation mode of CREAM allows to create metadata for existing web pages, the authoring mode lets authors create metadata -- almost for free -- while putting together the content of a page.
   As a particularity of our framework, CREAM allows to create relational metadata, i.e. metadata that instantiate interrelated definitions of classes in a domain ontology rather than a comparatively rigid template-like schema as Dublin Core. We discuss some of the requirements one has to meet when developing such an ontology-based framework, e.g. the integration of a metadata crawler, inference services, document management and a meta-ontology, and describe its implementation, viz. Ont-O-Mat, a component-based, ontology-driven Web page authoring and annotation tool.
Keywords: RDF, annotation, metadata, semanticWeb

XML Applications

An incremental XSLT transformation processor for XML document manipulation BIBAKFull-Text 474-485
  Lionel Villard; Nabil Layaïda
In this paper, we present an incremental transformation framework called incXSLT. This framework has been experimented for the XSLT language defined at the World Wide Web Consortium. For the currently available tools, designing the XML content and the transformation sheets is an inefficient, a tedious and an error prone experience. Incremental transformation processors such as incXSLT represent a better alternative to help in the design of both the content and the transformation sheets. We believe that such frameworks are a first step toward fully interactive transformation-based authoring environments.
Keywords: XML, XSLT, authoring tools, incremental transformations
An event-condition-action language for XML BIBAKFull-Text 486-495
  James Bailey; Alexandra Poulovassilis; Peter T. Wood
XML repositories are now a widespread means for storing and exchanging information on the Web. As these repositories become increasingly used in dynamic applications such as e-commerce, there is a rapidly growing need for a mechanism to incorporate reactive functionality in an XML setting. Event-condition-action (ECA) rules are a technology from active databases and are a natural method for supporting suchfunctionality. ECA rules can be used for activities such as automatically enforcing document constraints, maintaining repository statistics, and facilitating publish/subscribe applications. An important question associated with the use of a ECA rules is how to statically predict their run-time behaviour. In this paper, we define a language for ECA rules on XML repositories. We then investigate methods for analysing the behaviour of a set of ECA rules, a task which has added complexity in this XML setting compared with conventional active databases.
Keywords: XML, XML repositories, event-condition-action rules, reactive functionality, rule analysis
Fast and efficient client-side adaptivity for SVG BIBAKFull-Text 496-507
  Kim Marriott; Bernd Meyer; Laurent Tardif
The Scalable Vector Graphics format SVG is already substantially improving graphics delivery on the web, but some important issues still remain to be addressed. In particular, SVG does not support client-side adaption of documents to different viewing conditions, such as varying screen sizes, style preferences or different device capabilities. Based on our earlier work we show how SVG can be extended with constraint-based specification of document layout to augment it with adaptive capabilities. The core of our proposal is to include one-way constraints into SVG, which offer more expressiveness than the previously suggested class of linear constraints and at the same time require substantially less computational effort.
Keywords: CSVG, SVG, adaptivity, constraints, differential scaling, interaction, scalable vector graphics, semantic zooming

Link Analysis

Web page scoring systems for horizontal and vertical search BIBAKFull-Text 508-516
  Michelangelo Diligenti; Marco Gori; Marco Maggini
Page ranking is a fundamental step towards the construction of effective search engines for both generic (horizontal) and focused (vertical) search. Ranking schemes for horizontal search like the PageRank algorithm used by Google operate on the topology of the graph, regardless of the page content. On the other hand, the recent development of vertical portals (vortals) makes it useful to adopt scoring systems focussed on the topic and taking the page content into account.
   In this paper, we propose a general framework for Web Page Scoring Systems (WPSS) which incorporates and extends many of the relevant models proposed in the literature. Finally, experimental results are given to assess the features of the proposed scoring systems with special emphasis on vertical search.
Keywords: Focused PageRank, HITS, PageRank, random walks, web page scoring systems
Topic-sensitive PageRank BIBAKFull-Text 517-526
  Taher H. Haveliwala
In the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector is computed, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. By using these (precomputed) biased PageRank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. For ordinary keyword search queries, we compute the topic-sensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topic-sensitive PageRank scores using the topic of the context in which the query appeared.
Keywords: PageRank, link structure, personalized search, search, search in context, web graph
Improvement of HITS-based algorithms on web documents BIBAKFull-Text 527-535
  Longzhuang Li; Yi Shang; Wei Zhang
In this paper, we present two ways to improve the precision of HITS-based algorithms on Web documents. First, by analyzing the limitations of current HITS-based algorithms, we propose a new weighted HITS-based method that assigns appropriate weights to in-links of root documents. Then, we combine content analysis with HITS-based algorithms and study the effects of four representative relevance scoring methods, VSM, Okapi, TLS, and CDR, using a set of broad topic queries. Our experimental results show that our weighted HITS-based method performs significantly better than Bharat's improved HITS algorithm. When we combine our weighted HITS-based method or Bharat's HITS algorithm with any of the four relevance scoring methods, the combined methods are only marginally better than our weighted HITS-based method. Between the four relevance-scoring methods, there is no significant quality difference when they are combined with a HITS-based algorithm.
Keywords: HITS-based algorithms, information retrieval, relevance scoring methods

Advertising and Security for E-Commerce

Improvements in practical aspects of optimally scheduling web advertising BIBAKFull-Text 536-541
  Atsuyoshi Nakamura
We addressed two issues concerning the practical aspects of optimally scheduling web advertising proposed by Langheinrich et al. [5], which scheduling maximizes the total number of click-throughs for all banner advertisements. One is the problem of multi-impressions in which two or more banner ads are impressed at the same time. The other is inventory management, which is important in order to prevent over-selling and maximize revenue. We propose efficient methods which deal with these two issues.
Keywords: electronic commerce, inventory management, on-line advertisement, optimization, world-wide web
A lightweight protocol for the generation and distribution of secure e-coupons BIBAKFull-Text 542-552
  Carlo Blundo; Stelvio Cimato; Annalisa De Bonis
A form of advertisement which is becoming very popular on the web is based on electronic coupon (e-coupon) distribution. E-coupons are the digital analogue of paper coupons which are used to provide customers with discounts or gift in order to incentive the purchase of some products. Nowadays, the potential of digital coupons has not been fully exploited on the web. This is mostly due to the lack of "efficient" techniques to handle the generation and distribution of e-coupons. In this paper we discuss models and protocols for e-coupons satisfying a number of security requirements. Our protocol is lightweight and preserves the privacy of the users, since it does not require any registration phase.
Keywords: accountability, e-commerce, e-coupons, security
Protecting electronic commerce from distributed denial-of-service attacks BIBAKFull-Text 553-561
  José Brustoloni
It is widely recognized that distributed denial-of-service (DDoS) attacks can disrupt electronic commerce and cause large revenue losses. However, effective defenses continue to be mostly unavailable. We describe and evaluate VIPnet, a novel value-added network service for protecting e-commerce and other transaction-based sites from DDoS attacks. In VIPnet, e-merchants pay Internet Service Providers (ISPs) to carry the packets of the e-merchants' best clients (called VIPs) in a privileged class of service (CoS), protected from congestion, whether malicious or not, in the regular CoS. VIPnet rewards VIPs with not only better quality of service, but also greater availability. Because VIP rights are client- and server-specific, cannot be forged, are usage-limited, and are only replenished after successful client transactions (e.g., purchases), it is impractical for attackers to mount and sustain DDoS attacks against an e-merchant's VIPs. VIPnet can be deployed incrementally and does not require universal adoption. Experiments demonstrate VIPnet's benefits.
Keywords: denial of service, electronic commerce, quality of service

Description and Analysis

Using web structure for classifying and describing web pages BIBAKFull-Text 562-569
  Eric J. Glover; Kostas Tsioutsiouliklis; Steve Lawrence; David M. Pennock; Gary W. Flake
The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages.
Keywords: SVM, anchortext, classification, cluster naming, entropy based feature extraction, evaluation, web directory, web structure
ChangeDetector: a site-level monitoring tool for the WWW BIBAKFull-Text 570-579
  Vijay Boyapati; Kristie Chevrier; Avi Finkel; Natalie Glance; Tom Pierce; Robert Stockton; Chip Whitmer
This paper presents a new challenge for Web monitoring tools: to build a system that can monitor entire web sites effectively. Such a system could potentially be used to discover "silent news" hidden within corporate web sites. Examples of silent news include reorganizations in the executive team of a company or in the retirement of a product line. ChangeDetector, an implemented prototype, addresses this challenge by incorporating a number of machine learning techniques. The principal backend components of ChangeDetector all rely on machine learning: intelligent crawling, page classification and entity-based change detection. Intelligent crawling enables ChangeDetector to selectively crawl the most relevant pages of very large sites. Classification allows change detection to be filtered by topic. Entity extraction over changed pages permits change detection to be filtered by semantic concepts, such as person names, dates, addresses, and phone numbers. Finally, the front end presents a flexible way for subscribers to interact with the database of detected changes to pinpoint those changes most likely to be of interest.
Keywords: URL monitoring, classification, information extraction, intelligent crawling, machine learning
Template detection via data mining and its applications BIBAKFull-Text 580-591
  Ziv Bar-Yossef; Sridhar Rajagopalan
We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.
Keywords: data mining, hypertext, information retrieval, web searching

Query Language for Semantic Web

RQL: a declarative query language for RDF BIBAFull-Text 592-603
  Gregory Karvounarakis; Sofia Alexaki; Vassilis Christophides; Dimitris Plexousakis; Michel Scholl
Real-scale Semantic Web applications, such as Knowledge Portals and E-Marketplaces, require the management of large volumes of metadata, i.e., information describing the available Web content and services. Better knowledge about their meaning, usage, accessibility or quality will considerably facilitate an automated processing of Web resources. The Resource Description Framework (RDF) enables the creation and exchange of metadata as normal Web data. Although voluminous RDF descriptions are already appearing, sufficiently expressive declarative languages for querying both RDF descriptions and schemas are still missing. In this paper, we propose a new RDF query language called RQL. It is a typed functional language (a la OQL) and relies on a formal model for directed labeled graphs permitting the interpretation of superimposed resource descriptions by means of one or more RDF schemas. RQL adapts the functionality of semistructured/XML query languages to the peculiarities of RDF but, foremost, it enables to uniformly query both resource descriptions and schemas. We illustrate the RQL syntax, semantics and typing system by means of a set of example queries and report on the performance of our persistent RDF Store employed by the RQL interpreter.
EDUTELLA: a P2P networking infrastructure based on RDF BIBAKFull-Text 604-615
  Wolfgang Nejdl; Boris Wolf; Changtao Qu; Stefan Decker; Michael Sintek; Ambjörn Naeve; Mikael Nilsson; Matthias Palmér; Tore Risch
Metadata for the World Wide Web is important, but metadata for Peer-to-Peer (P2P) networks is absolutely crucial. In this paper we discuss the open source project Edutella which builds upon metadata standards defined for the WWW and aims to provide an RDF-based metadata infrastructure for P2P applications, building on the recently announced JXTA Framework. We describe the goals and main services this infrastructure will provide and the architecture to connect Edutella Peers based on exchange of RDF metadata. As the query service is one of the core services of Edutella, upon which other services are built, we specify in detail the Edutella Common Data Model (ECDM) as basis for the Edutella query exchange language (RDF-QEL-i) and format implementing distributed queries over the Edutella network. Finally, we shortly discuss registration and mediation services, and introduce the prototype and application scenario for our current Edutella aware peers.
Keywords: e-Learning, peer-to-peer, query languages, semantic web
Translating XSLT programs to Efficient SQL queries BIBAKFull-Text 616-626
  Sushant Jain; Ratul Mahajan; Dan Suciu
We present an algorithm for translating XSLT programs into SQL. Our context is that of virtual XML publishing, in which a single XML view is defined from a relational database, and subsequently queried with XSLT programs. Each XSLT program is translated into a single SQL query and run entirely in the database engine. Our translation works for a large fragment of XSLT, which we define, that includes descendant/ancestor axis, recursive templates, modes, parameters, and aggregates. We put considerable effort in generating correct and efficient SQL queries and describe several optimization techniques to achieve this efficiency. We have tested our system on all 22 SQL queries of the TPC-H database benchmark which we represented in XSLT and then translated back to SQL using our translator.
Keywords: SQL, XML, XSLT, query optimization, translation, virtual view

Mobility and Wireless Access

Personalized pocket directories for mobile devices BIBAKFull-Text 627-638
  Doron Cohen; Michael Herscovici; Yael Petruschka; Yoëlle S. Maarek; Aya Soffer
In spite of the increase in the availability of mobile devices in the last few years, Web information is not yet as accessible from PDAs or WAP phones as it is from the desktop. In this paper, we propose a solution for supporting one of the most popular information discovery mechanisms, namely Web directory navigation, from mobile devices. Our proposed solution consists of caching enough information on the device itself in order to conduct most of the navigation actions locally (with subsecond response time) while intermittently communicating with the server to receive updates and additional data requested by the user. The cached information is captured in a "directory capsule". The directory capsule represents only the portion of the directory that is of interest to the user in a given context and is sufficiently rich and consistent to support the information needs of the user in disconnected mode. We define a novel subscription model specifically geared for Web directories and for the special needs of PDAs. This subscription model enables users to specify the parts of the directory that are of interest to them as well as the preferred granularity. We describe a mechanism for keeping the directory capsule in sync over time with the Web directory and user subscription requests. Finally, we present the Pocket Directory Browser for Palm powered computers that we have developed. The pocket directory can be used to define, view and manipulate the capsules that are stored on the Palm. We provide several usage examples of our system on the Open Directory Project, one of the largest and most popular Web directories.
Keywords: hierarchical browsers, mobile devices, mobile search, personalization
A web middleware architecture for dynamic customization of content for wireless clients BIBAKFull-Text 639-650
  Jesse Steinberg; Joseph Pasquale
We present a new Web middleware architecture that allows users to customize their view of the Web for optimal interaction and system operation when using non-traditional resource-limited client machines such as wireless PDAs (personal digital assistants). Web Stream Customizers (WSC) are dynamically deployable software modules and can be strategically located between client and server to achieve improvements in performance, reliability, or security. An important design feature is that Customizers provide two points of control in the communication path between client and server, supporting adaptive system-based and content-based customization. Our architecture exploits HTTP's proxy capabilities, allowing Customizers to be seamlessly integrated with the basic Web transaction model. We describe the WSC architecture and implementation, and illustrate its use with three non-trivial, adaptive Customizer applications that we have built. We show that the overhead in our implementation is small and tolerable, and is outweighed by the benefits that Customizers provide.
Keywords: HTTP, middleware, mobile code, proxy, wireless
Mobile streaming media CDN enabled by dynamic SMIL BIBAKFull-Text 651-661
  Takeshi Yoshimura; Yoshifumi Yonemoto; Tomoyuki Ohya; Minoru Etoh; Susie Wee
In this paper, we present a mobile streaming media CDN (Content Delivery Network) architecture in which content segmentation, request routing, pre-fetch scheduling, and session handoff are controlled by SMIL (Synchronized Multimedia Integrated Language) modification. In this architecture, mobile clients simply follow modified SMIL files downloaded from a streaming portal server; these modifications enable multimedia content to be delivered to the mobile clients from the best surrogates in the CDN. The key components of this architecture are 1) content segmentation with SMIL modification, 2) on-demand rewriting of URLs in SMIL, 3) pre-fetch scheduling based on timing information derived from SMIL, 4) SMIL updates by SOAP (Simple Object Access Protocol) messaging for session handoffs due to clients mobility. We also introduce QoS control with a network agent called an "RTP monitoring agent" to enable appropriate control of media quality based on both network congestion and radio link conditions. The current status of our prototyping on a mobile QoS testbed "MOBIQ" is reported in this paper. We are currently designing the SOAP-based APIs (Application Programmable Interfaces) needed for the mobile streaming media CDN and building the CDN over the current testbed.
Keywords: CDN, SMIL, mobile network, streaming media


Learning to map between ontologies on the semantic web BIBAKFull-Text 662-673
  AnHai Doan; Jayant Madhavan; Pedro Domingos; Alon Halevy
Ontologies play a prominent role on the Semantic Web. They make possible the widespread publication of machine understandable data, opening myriad opportunities for automated information processing. However, because of the Semantic Web's distributed nature, data on it will inevitably come from many different ontologies. Information processing across ontologies is not possible without knowing the semantic mappings between their elements. Manually finding such mappings is tedious, error-prone, and clearly not possible at the Web scale. Hence, the development of tools to assist in the ontology mapping process is crucial to the success of the Semantic Web.
   We describe glue, a system that employs machine learning techniques to find such mappings. Given two ontologies, for each concept in one ontology glue finds the most similar concept in the other ontology. We give well-founded probabilistic definitions to several practical similarity measures, and show that glue can work with all of them. This is in contrast to most existing approaches, which deal with a single similarity measure. Another key feature of glue is that it uses multiple learning strategies, each of which exploits a different type of information either in the data instances or in the taxonomic structure of the ontologies. To further improve matching accuracy, we extend glue to incorporate commonsense knowledge and domain constraints into the matching process. For this purpose, we show that relaxation labeling, a well-known constraint optimization technique used in computer vision and other fields, can be adapted to work efficiently in our context. Our approach is thus distinguished in that it works with a variety of well-defined similarity notions and that it efficiently incorporates multiple types of knowledge. We describe a set of experiments on several real-world domains, and show that glue proposes highly accurate semantic mappings.
Keywords: machine learning, ontology mapping, relaxation labeling, semantic web
Vocabulary development for markup languages: a case study with maritime information BIBAKFull-Text 674-685
  Raphael Malyankar
This paper describes the process of constructing a markup language for maritime information from the starting point of ontology building. Ontology construction from source materials in the maritime information domain is outlined. The structure of the markup language is described in terms of XML schemas and DTDs. A prototype application that uses the markup language is also described.
Keywords: XML, maritime information, ontologies
A pragmatic application of the semantic web using SemTalk BIBAKFull-Text 686-692
  Christian Fillies; Gay Wood-Albrecht; Frauke Weichardt
The is a new layer of the Internet that enables semantic representation of the contents of existing web pages. Using common ontologies, human users sketch out the most important facts in models that act as intelligent whiteboards. Once models are broadcasted to the Internet, new and intelligent search engines, "ambient" intelligent devices and agents would be able to exploit this knowledge network. [1].
   The main idea of SemTalk is to empower end users to contribute to the Semantic Web by offering an easy to use MS Visio-based graphical editor to create RDF-like schema and workflows. Since the modeled data is found by Microsoft's SmartTags, users can benefit from these Semantic Webs as part of their daily work with other Microsoft Office products such as Word, Excel or Outlook. This paper presents two applied uses of this technology:
   SemTalk's graphically configurable meta model also extends the functionality of the Visio modeling tool because it makes it easy to configure Visio to different modeling worlds such as Business Engineering and CASE methodologies but also to these features can be applied to any other Visio drawings.
   Ontology Project: Department-wide information modeling at the Credit Suisse Bank. Main emphasis was on linguistic standardization of terms. Based on a common central glossary, local knowledge management teams were able to develop specialized models for their decentralized departments. As part of the knowledge management process local glossaries were continually carried over into a common shared model.
   Business Process Management Project: Distributed process modeling of the Bausparkasse Deutscher Ring, a German financial institution. Several groups of students from the Technical University FH Brandenburg explored how to develop and apply an industry-specific Semantic Web to Business Process Modeling.
Keywords: business process modeling, glossary and ontologies, semantic web


Visualizing web site comparisons BIBAKFull-Text 693-703
  Bing Liu; Kaidi Zhao; Lan Yi
The Web is increasingly becoming an important channel for conducting businesses, disseminating information, and communicating with people on a global scale. More and more companies, organizations, and individuals are publishing their information on the Web. With all this information publicly available, naturally companies and individuals want to find useful information from these Web pages. As an example, companies always want to know what their competitors are doing and what products and services they are offering. Knowing such information, the companies can learn from their competitors and/or design countermeasures to improve their own competitiveness. The ability to effectively find such business intelligence information is increasingly becoming crucial to the survival and growth of any company. Despite its importance, little work has been done in this area. In this paper, we propose a novel visualization technique to help the user find useful information from his/her competitors' Web site easily and quickly. It involves visualizing (with the help of a clustering system) the comparison of the user's Web site and the competitor's Web site to find similarities and differences between the sites. The visualization is such that with a single glance, the user is able to see the key similarities and differences of the two sites. He/she can then quickly focus on those interesting clusters and pages to browse the details. Experiment results and practical applications show that the technique is effective.
Keywords: browsing, business intelligence, user-interface, visualization, web site comparison
Web montage: a dynamic personalized start page BIBAKFull-Text 704-712
  Corin R. Anderson; Eric Horvitz
Despite the connotation of the words "browsing" and "surfing," web usage often follows routine patterns of access. However, few mechanisms exist to assist users with these routine tasks; bookmarks or portal sites must be maintained manually and are insensitive to the user's browsing context. To fill this void, we designed and implemented the montage system. A web montage is an ensemble of links and content fused into a single view. Such a coalesced view can be presented to the user whenever he or she opens the browser or returns to the start page. We pose a number of hypotheses about how users would interact with such a system, and test these hypotheses with a fielded user study. Our findings support some design decisions, such as using browsing context to tailor the montage, raise questions about others, and point the way toward future work.
Keywords: adaptive user interfaces, adaptive web sites, personalization, user modeling
Building voiceXML browsers with openVXI BIBAKFull-Text 713-717
  Brian Eberman; Jerry Carter; Darren Meyer; David Goddeau
The OpenVXI is a portable open source based toolkit that interprets the VoiceXML dialog markup language. It is designed to serve as a framework for system integrators and platform vendors who want to incorporate VoiceXML into their platform. A first version of the toolkit was released in the winter of 2001, with a second version released in September of 2001. A number of companies and individuals have adopted the toolkit for their platforms. In this paper we discuss the architecture of the toolkit, the architectural issues involved with implementing a framework for VoiceXML, performance results with the OpenVXI, and future directions for the toolkit.
Keywords: openVXI, voiceXML

UI and Applications

A graphical user interface toolkit approach to thin-client computing BIBAKFull-Text 718-725
  Simon Lok; Steven K. Feiner; William M. Chiong; Yoav J. Hirsch
Network and server-centric computing paradigms are quickly returning to being the dominant methods by which we use computers. Web applications are so prevalent that the role of a PC today has been largely reduced to a terminal for running a client or viewer such as a Web browser. Implementers of network-centric applications typically rely on the limited capabilities of HTML, employing proprietary "plug ins" or transmitting the binary image of an entire application that will be executed on the client. Alternatively, implementers can develop without regard for remote use, requiring users who wish to run such applications on a remote server to rely on a system that creates a virtual frame buffer on the server, and transmits a copy of its raster image to the local client.
   We review some of the problems that these current approaches pose, and show how they can be solved by developing a distributed user interface toolkit. A distributed user interface toolkit applies techniques to the high level components of a toolkit that are similar to those used at a low level in the X Window System. As an example of this approach, we present RemoteJFC, a working distributed user interface toolkit that makes it possible to develop thin-client applications using a distributed version of the Java Foundation Classes.
Keywords: client-server systems, network computing, remote method invocation, user interface toolkit
Clustering for opportunistic communication BIBAKFull-Text 726-735
  Jay Budzik; Shannon Bradshaw; Xiaobin Fu; Kristian J. Hammond
We describe ongoing work on I2I, a system aimed at fostering opportunistic communication among users viewing or manipulating content on the Web and in productivity applications. Unlike previous work in which the URLs of Web resources are used to group users visiting the same resource, we present a more general framework for clustering work contexts to group users together that accounts for dynamic content and distributional properties of Web accesses which can limit the utility URL based systems. In addition, we describe a method for scaffolding asynchronous communication in the context of an ongoing task that takes into account the ephemeral nature of the location of content on the Web. The techniques we describe also nicely cover local files in progress, in addition to publicly available Web content. We present the results of several evaluations that indicate systems that use the techniques we employ may be more useful than systems that are strictly URL based.
Keywords: agents, awareness, clustering, collaboration, context, critical mass, opportunistic communication
Featuring web communities based on word co-occurrence structure of communications BIBAKFull-Text 736-742
  Yukio Ohsawa; Hirotaka Soma; Yutaka Matsuo; Naohiro Matsumura; Masaki Usui
Textual communication in message boards is analyzed for classifying Web communities. We present a communication-content based generalization of an existing business-oriented classification of Web communities, using KeyGraph, a method for visualizing the co-occurrence relations between words and word clusters in text. Here, the text in a message board is analyzed with KeyGraph, and the structure obtained is shown to reflect the essence of the content-flow. The relation of this content-flow with participants' interests is then formalized. Three structure-features of relations between participants and words, determining the type of the community, are shown to be computed and visualized: (1) centralization (2) context coherence and (3) creative decisions. This helps in surveying the essence of a community, e.g. whether the community creates useful knowledge, how easy it is to join the community, and whether/why the community is good for making commercial advertisement.
Keywords: context, creativity, text mining, web community