Another focus of our group is in the area of emerging database and data service technologies for the Internet. In this area, our work on rank aggregation methods for the Web was highlighted as the best paper on Searching, Indexing, and Querying at the 2001 World Wide Web Conference (WWW10). We have also made contributions to foundational questions on sampling algorithms for massive data sets, and to obtaining better characterizations of the Web for scalable data services.
Assume that each object in a database has m grades, or scores, one for each of m attributes. For example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. For each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). There is some monotone aggregation function, or combining rule, such as min or average, that combines the individual grades to obtain an overall grade. To determine the top k objects (that have the best overall grades), the naive algorithm must access every object in the database, to find its grade under each attribute. We analyze an elegant and remarkably simple algorithm that is essentially optimal, not just in the traditional worst-case sense, but over every database. This paper won the Best Paper Award at the 2001 ACM Symposium on Principles of Database Systems (PODS 01).
The rank aggregation problem is to combine many different rank orderings on the same set of candidates, or alternatives, in order to obtain a better ("consensus") ordering. Members of our group have studied this problem in the context of WWW search applications: combating "spam", meta-search, and multi-word/multi-criteria searches. Our results include NP-hardness results for certain optimal aggregation methods, efficiently computable relaxations of such aggregations that nevertheless have excellent properties for web applications, and several aggregation heuristics based on Markov chains. This paper appeared in the 2001 World Wide Web Conference (WWW10), and was selected as the best paper in the area Searching, Querying, and Indexing.
Probabilistic sampling algorithms are a very useful paradigm in the context of computing with massive data sets. We have developed a framework to study probabilistic sampling algorithms that approximate general functions from D^n to R, where where D and R are arbitrary sets. Our goal is to understand the fundamental limitations of such algorithms. We have invented two techniques to prove lower bounds on the number of samples needed for computing several basic quantities including various statistical and frequency moments. We also propose in this work a computational definition of the notion of lossy data compression, and point out connections between sampling algorithms, streaming algorithms and lossy compression schemes. This paper will appear in the 2001 ACM Symposium on Theory of Computing (STOC 01).
Algorithmic tools for searching and mining the web are becoming increasingly sophisticated and vital. In this context, algorithms which use and exploit structural information about the web perform better than generic methods in both efficiency and reliability. Following much earlier work by group members, in this work, we provide a characterization of the web that shows that the web emerges as the outcome of a number of essentially independent stochastic processes that evolve at various scales. A striking consequence of this scale invariance is that the structure of the web is ``fractal'' --- cohesive sub-regions display the same characteristics as the web at large. An understanding of this underlying fractal nature is therefore applicable to designing data services across multiple domains and scales. We describe potential applications of this line of research to optimized algorithm design for web-scale data analysis. This paper will appear in the 2001 International Conference on Very Large Data Bases (VLDB 01).