| Home | Research | Publications | Miscellaneous |
- Web data management (crawling, scalable indexing, link analysis, mining, and query processing over large heterogeneous Web collections)
- Database systems (query processing and optimization, database and IR systems integration)
- Information retrieval systems (Web-scalable index structures, distributed indexing techniques)
- Digital Libraries (interoperability, search protocols)
The Web is quickly becoming the default one-stop source for satisfying every possible information need - from obtaining the location of the nearest grocery store to gathering data for market research, conducting sociological studies, or identifying terrorist organizations. As the Web continues to grow in size and diversity, sophisticated tools and systems for searching, analyzing, and mining Web content are becoming increasingly important. A key prerequisite for the successful deployment of such ``information/knowledge discovery'' applications is a Web Repository: a system to gather, organize, and provide access to large heterogeneous collections of Web pages.
This thesis addresses several key issues in the architecture, design, and implementation of Web repositories. The underlying theme of this thesis is that the unique combination of features that characterize the Web - uncontrolled and distributed data creation, massive size, rapid growth, hyperlinked connections, and heterogeneity in content and structure - pose special challenges to traditional data management problems. This thesis explores some of these problems, specifically, data extraction, storage organization, indexing, query model, and query optimization, in the context of Web repositories. The algorithms and techniques developed here have been implemented as part of the Stanford WebBase testbed. Some of the key contributions of this thesis are:
- Data extraction and Storage Management
[WWW9] [VLDB2001]
- A model of a crawler for extracting content from text databases in the "hidden/deep" Web, and Hidden Web Extractor (HiWE), a prototype crawler based on this model
- Layout-based Information ExTraction (LITE) - an information extraction technique (used in HiWE) based on the idea of treating a Web page as a two-dimensional rendered object rather than as a uni-dimensional stream of characters
- A scalable distributed storage manager for staging, integrating, and disseminating large Web data sets and a comprehensive performance study of several alternate designs for distributing and organizing Web pages
- Indexing
[WWW10] [TOIS 2001]
[ICDE 2003]
- A software pipelining technique (to exploit parallelism) and an embedded database-backed storage format for building inverted indexes for text-based retrieval over Web-scale collections
- S-Node, a compact two-level graph representation that exploits certain empirically observed properties of Web graphs to achieve significant compression and high retrieval efficiency when executing complex queries
- Query Model, Execution, and Optimization
[VLDB 2003] [Data Engg. Bulletin 2001]
- A model of Web repositories based on a simple relational schema and the notion of Web relations
- An algebraic query language for expressing complex Web queries featuring (i) well-defined semantics when combining navigation, relational, and text-search operators, and (ii) support for user-defined ranking and ordering functions
- A cluster-based optimization technique to efficiently execute complex Web queries over large repositories
|
Privacy | Legal | IBM Home | Research Home | Almaden Home |