Home Research Publications Miscellaneous

Sriram Raghavan's Research
Research Interests
Ph.D. Thesis Research: Architecting a Web repository

The Web is quickly becoming the default one-stop source for satisfying every possible information need - from obtaining the location of the nearest grocery store to gathering data for market research, conducting sociological studies, or identifying terrorist organizations. As the Web continues to grow in size and diversity, sophisticated tools and systems for searching, analyzing, and mining Web content are becoming increasingly important. A key prerequisite for the successful deployment of such ``information/knowledge discovery'' applications is a Web Repository: a system to gather, organize, and provide access to large heterogeneous collections of Web pages.

This thesis addresses several key issues in the architecture, design, and implementation of Web repositories. The underlying theme of this thesis is that the unique combination of features that characterize the Web - uncontrolled and distributed data creation, massive size, rapid growth, hyperlinked connections, and heterogeneity in content and structure - pose special challenges to traditional data management problems. This thesis explores some of these problems, specifically, data extraction, storage organization, indexing, query model, and query optimization, in the context of Web repositories. The algorithms and techniques developed here have been implemented as part of the Stanford WebBase testbed. Some of the key contributions of this thesis are: