Activity Report, July - December 2002
- Highlights
- Awards and Honors
- Publications and Conferences
- Patents
- Invited Talks
- Presentations and Seminars
- Committee Work
- Editorships
- External Positions
- Miscellaneous
- Personnel
- All activity reports
Searching the Workplace Web
The social impact from the World Wide Web cannot be underestimated, but technologies used to build the Web are also revolutionizing the sharing of business and government information within intranets. In many ways the lessons learned from the Internet carry over directly to intranets, but others do not apply. In particular, the social forces that guide the development of intranets are quite different, and the determination of a "good answer" for intranet search is quite different than on the Internet.
Department members have proposed a novel approach to the problem of intranet search, based on the use of rank aggregation. The approach allows the search-engine builder to examine the effects of different heuristics on ranking of web search results; this leads to an architecture that can be easily customized for deployment in a variety of scenarios.
The paper also reports experiments with several ranking functions, conducted on the IBM intranet (comprising about 50 million unique URLs and about 20 million non-unique documents).It was accepted to the 2003 International World-Wide Web Conference (WWW 2003).
Two Applications of Information Complexity
Last year, researchers in our department had invented a methodology for lower bounds in communication complexity based on information complexity. Using this notion, T.S. Jayram, Ravi Kumar, and D. Sivakumar have obtained the following new lower bounds, which provide a sharp contrast between randomized and nondeterministic complexity.
- A linear lower bound on the randomized communication complexity (with two-sided error) for the tribes function introduced by Ben-Or and Linial in 1990. This settles an open problem posed by Beame and Lawry (1992) and by Kushilevitz and Nisan (1997).
- A lower bound of (7/3)h on the randomized decision tree complexity (with two-sided error) for the recursive majority function of height h introduced by Boppana in 1986. This improves all known lower bounds for this problem.
A paper will appear in the 2003 Symposium on Theory of Computing (STOC 2003).
Efficient Similarity Search and Classification via Rank Aggregation
The problems of similarity search and classification are ubiquitous in various fields - information retrieval, pattern recognition, etc. Ron Fagin, Ravi Kumar, and D. Sivakumar have proposed a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, a small number of independent "voters" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. The methodology leads both to techniques for computing approximate nearest neighbors and a conceptually rich alternative to nearest neighbors.
One instantiation of the methodology yields a (1+e)-factor approximation to the Euclidean nearest neighbor problem, and also turns out to be extremely efficient in practice, often exploring no more than 5% of the data to obtain very high-quality results. This method is also database friendly in that it accesses data primarily in a pre-defined order without random accesses, and, unlike other methods for approximate nearest neighbors, requires almost no extra storage.
A paper will appear in the 2003 ACM SIGMOD Conference.
Data Exchange: Semantics and Query Answering
Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data. Since the source and target schemas each have their own constraints (such as keys), there may not exist a solution, that is, a target database that is consistent with the specifications. Or, instead of no solutions, there may be many solutions. Ron Fagin, Phokion Kolaitis, Renee Miller, and Lucian Popa gave an algebraic specification that selects, among all possible solutions for a given source instance, a special class of solutions that they call "universal". They showed that a universal solution has "good" properties that justify its choice for the semantics of the data exchange problem. They identified fairly general, yet practical, sufficient conditions that guarantee the existence of a universal solution. They gave a polynomial-time algorithm that determines whether a solution exists and, if so, produces a universal solution. This paper was accepted to the 2003 International Conference on Database Theory (ICDT 2003).
Lower Bounds for the Partial Match Problem
Given a database of 0-1 vectors, the partial match problem is: in response to a query consisting of 0, 1, and wildcard '?', find a database point such that the query matches the database point, where matching is interpreted in the usual regular expression sense. This is a classical problem and many prominent researchers have worked on it: Rivest (1974),... ,Borodin, Ostrovsky, and Rabani (1998), Charikar, Indyk, and Panigrahy (2002). T.S. Jayram, Subhash Khot, Ravi Kumar, and Yuval Rabani show a strong data structure lower bound for this problem, improving the existing lower bounds by an exponential amount. Their lower bound suggests that no amount of preprocessing the database can lead to a compact data structure that can answer queries efficiently. Their lower bound also leads to new and improved lower bounds for many versions of the equally well-known nearest-neighbor problem. The paper was accepted to the 2003 Symposium on Theory of Computing (STOC 2003).
Sampling Lower Bounds via Information Theory
Ziv Bar-Yossef presents a novel technique, based on the Jensen-Shannon divergence from information theory, to prove lower bounds on the query complexity of sampling algorithms that approximate functions over arbitrary domain and range. His technique deviates from the traditional paradigm of proving lower bounds for approximations that uses a reduction from a decision "gap" problem. As a result, he is able to obtain much stronger bounds than previous methods for functions f that possess a large set of inputs x_1,...,x_m, for which there is a "gap" between f(x_i) and f(x_j) for all i different from j. He demonstrates his technique with new query complexity lower bounds for three fundamental problems: a quadratic improvement over previous bounds for the frequency distribution estimation problem; first lower bounds for low rank matrix approximation, showing that the algorithms given for this problem are optimal up to polynomial factors; and lower bounds for the matrix reconstruction problem. The paper was accepted to the 2003 Symposium on Theory of Computing (STOC 2003).
Incremental Computation of PageRank
As the web grows, its link structure, along with content, evolves rapidly. Consequently, large-scale static hyperlink-based ranking computations become too expensive to be performed frequently. Steve Chien, Cynthia Dwork, Ravi Kumar, and D. Sivakumar present a very efficient algorithm to incrementally compute good approximations to PageRank, as links evolve. Preliminary experiments reveal that this algorithm is both fast and yields excellent approximations to PageRank, even in light of large amounts of changes to the link structure. Their algorithm derives intuition and partial justification from a rigorous monotonicity analysis of Markov chains, where they prove two monotonicity theorems - adding a link always causes the PageRank (both the value and the ordinal position) of the target to only increase. The paper was presented at the Workshop on Algorithms and Models for the Web-Graph (WAW 2002).
A Fair, Truthful Mechanism for Bandwidth Allocation
In the bandwidth allocation problem, there is a fixed network with real-valued capacities on the edges. There are n users, and each user holds a path between some source node and some sink node. In addition, each user has a utility function that is a measure of the inherent value the user attains as a result of being allocated various values of bandwidth along her path. For the bandwidth allocation problem, D. Sivakumar and S. Gollapudi (SUNY/Buffalo and Oracle Corp.) have invented a mechanism that is provably fair and truthful. The fairness property informally means that the mechanism ensures that the allocation of no user can be increased without decreasing the "welfare" of any other user who is more "socially responsible". A highlight of this work is the fact that fairness and truthfulness are simultaneously addressed. The new notion of fairness, which extends the classical max-min fairness, appears to be well-suited for the class of allocation problems with resource and welfare constraints.
Hierarchical Opinion Pools: OLAP for Unstructured Data
OLAP reporting is a standard tool in many organizations to study business intelligence (BI) from structured data. Unstructured data on the Internet and intranets is believed to provide valuable information for BI. However, unlike structured data which consists primarily of facts, this information consists mainly of opinions which are inherently uncertain. T.S. Jayram, Shivakumar Vaithyanathan, and Huaiyu Zhu propose a probabilistic framework for BI reporting from text data. Information extracted from text can be broadly classified into two categories, facts and opinions. Opinions are akin to measures in conventional OLAP, while facts are differentiated as extrinsic (meta data associated with the text) and intrinsic (the dimensions about which opinions are expressed). Hierarchies over both types of facts can be defined. Opinions can be reported at different levels of aggregation by casting it as a problem of opinion pooling. Several approaches to achieve this are described.
Set Systems Used for Broadcast Encryption
An exclusive set system is a family of subsets of a universe with the property that every large subset of the universe may be written as the union of subsets from the family. Such set systems form the combinatorial foundation of many broadcast encryption schemes. Ravi Kumar and Alexander Russel obtain new existential upper bounds on the size of such families. This improves the bounds given by standard random constructions. The paper was presented at the 2003 Symposium on Discrete Algorithms (SODA 2003).
Communities and Reputation on the Web
Reputation measures the importance of a page to a particular community, as derived from the local link structure around the page. David Gibson, in his Ph.D. dissertation at UC Berkeley, shows that hub-authority algorithms provide a robust measure of reputation, and can be adapted to deal with variations in link density which normally cause a topic drift problem.
Awards and Honors
M. Ajtai, R. Fagin, and L. Stockmeyer won IBM Outstanding Innovation Awards for Differential Backup in Tivoli Storage Manager.
The paper "Rank aggregation methods for the Web" by C. Dwork (Microsoft), R. Kumar, M. Naor (Weizmann, K53 visitor 1999-2001) and D. Sivakumar received an IBM Research Division Best Paper award in Computer Science for the year 2001.
The paper "Optimal aggregation algorithms in the middleware" by R. Fagin, A. Lotem (Maryland), and M. Naor (Weizmann, K53 visitor 1999-2001) received an IBM Research Division Best Paper award in Computer Science for the year 2001.
D. de Farias won the INFORMS George B. Dantzig Dissertation Award.
R. Fagin won an Invention Achievement Award, 2nd plateau, Dec. 2002.
K. McCurley, S. Rajagopalan, and A. Tomkins received IBM Outstanding Technical Achievement Awards for WebFountain.
Publications and Conferences
M. Ajtai, Random lattices and a conjectured 0-1 law about their polynomial time computable properties, Proc. 43rd IEEE Symposium on Foundations of Computer Science, November 2002, pp. 733-742.
P. Andritsos, R. Fagin, A. Fuxman, L. Haas, M. Hernandez, H. Ho, A. Kementsietsidis, R. J. Miller, F. Nauman, L. Popa, Y. Velegrakis, C. Vilarem, and L. Yan, Schema management, IEEE Data Engineering Bulletin, No. 25, Vol. 3, September 2002, pp. 33-39.
Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar, An information statistics approach to data stream and communication complexity, Proc. 43rd IEEE Symposium on Foundations of Computer Science, November 2002, pp. 209-218.
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan, Counting distinct elements in a data stream,Proc. 6th International Workshop on Randomization and Approximation Techniques, September 2002, pp. 1-10.
M. Charikar, R. Fagin, V. Guruswami, J. Kleinberg, P. Raghavan, and A. Sahai, Query strategies for priced information, Journal of Computer and System Sciences, Vol. 64, No. 4, June 2002, pp. 785-819. (Special issue for selected papers from the 2000 ACM Symposium on Theory of Computing).
S. Chien, C. Dwork, R. Kumar, D. Simon, and D. Sivakumar, Link evolution: Analysis and algorithms, Proc. Workshop on Algorithms and Models for the Web-Graph, November 2002.
S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, Self-similarity in the web, ACM Transactions on Internet Technology, Vol.2, No. 3, August 2002, pp. 205--223.
M. Hernandez, R. J. Miller, L. Popa, Y. Velegrakis, and R. Fagin, Translating web data, Proc. 28th International Conference on Very Large Data Bases, August 2002, pp. 598-609.
K. Jain, I. Mandoiu, V. Vazirani, and D. P. Williamson, A primal-dual schema based approximation algorithm for the element connectivity problem, Journal of Algorithms, Vol. 45, October 2002, pp. 1-15.
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, The Web and social networks, IEEE Computer, Vol.35, No. 11, November 2002, pp. 32-36.
Patents
D. P. Williamson, B. M. Schieber, G. M. Preston, G. B. Sorkin, M. O. Carey Iii, and M. A. Fus, Method and Apparatus for Scheduling Mobile Agents Utilizing Rapid Two-Way Communication, Issued as Patent 6484036 in US, 11/19/2002.
- D. de Farias
-
"The Linear Programming approach to approximate dynamic programming",
Department of Industrial and Operations Engineering, University of Michigan at Ann Arbor, Ann Arbor, MI, September 2002.
- R. Fagin
-
"Optimal aggregation algorithms for middleware",
Distinguished Lecture, University of Toronto, Toronto, Canada, October 2002.
- D. Williamson
-
"Finding provably near-optimal solutions to discrete optimization problems",
SIAM 50th Anniversary Meeting, Philadelphia, PA, July 2002 (invited semi-plenary talk.
SIAM Conference on Discrete Mathematics, San Diego, CA, August 2002 (invited plenary talk).
- Z. Bar-Yossef
-
"An information statistics approach to data stream and communication complexity",
Cryptography and Complexity Seminar, Weizmann Institute of Science, Rehovot, Israel, September 2002.
- D. Gibson
-
"Communities and Reputation on the Web",
Dissertation talk, UC Berkeley, Berkeley, CA, December 2002.
- T. S. Jayram
-
"An information statistics approach to data stream and communication complexity",
University of Washington, Seattle, WA, November 2002.
- D. Williamson
-
"An application of complex semidefinite programming to approximation algorithms",
Semidefinite Programming and Applications, Mathematical Sciences Research Institute, Berkeley, CA, October 2002.
- D. de Farias Session Chair, Approximate Dynamic Programming I, II, III and IV, INFORMS Annual Meeting, 2002.
- R. Fagin Program committee, 9th International Conference on Database Theory, 2003.
- S. Rajagopalan Program committee, 12th International World Wide Web Conference, 2003.
- D. Williamson
-
Organizer, Bay Area Theory Symposium (BATS).
Member, INFORMS Lanchester Prize Committee.
- R. Fagin
Chicago Journal of Theoretical Computer Science
Journal of Computer and System Sciences
Methods of Logic in Computer Science - N. Megiddo
Area editor, Journal of the
ACM
Area editor, Mathematics of Operations Research
Area editor, Operations Research Letters
Editorial Board Member, Algorithmica
Editorial Board Member, Discrete Applied Mathematics
Editorial Board Member, Games and Economic Behavior
Editorial Board Member, Journal of Combinatorial Optimization
Associate Editor, Mathematical Programming - D. Williamson
Associate Editor, Journal of Algorithms
Associate Editor, Mathematics of Operations Research
Associate Editor, SIAM Journal on Computing
Associate Editor, SIAM Journal on Discrete Mathematics
Editorial Board member, MPS/SIAM Book Series on Optimization
- M. Ajtai Member of the Hungarian Academy of Sciences
- N. Megiddo Consulting (Full) Professor Department of Management Science and Engineering, Stanford University Member, The First Council of the Game Theory Society Member of the Scientific Board, The Institute for Medical BioMathematics, Israel
- D. Williamson Member-at-large, Mathematical Programming Society Council
Miscellaneous
D. Gibson completed a Ph.D. dissertation at UC Berkeley, whose title is: "Communities and Reputation on the Web".
D. Sivakumar Taught a class on computational complexity theory at Stanford University during Fall 2002.
Ziv Bar-Yossef joined the
department, July 2002. He is a recent graduate of UC Berkeley.
Daniela de Farias joined the department as a postdoc, September 2002. She is a recent graduate of Stanford University and is about to join MIT next fall.
Jim Hafner is on a two-year leave to work with the Storage Attachment project
at ARC.
Sridhar Rajagopalan is on leave to work with the Trevi project in the Text Search for Content Management group at ARC.
Larry Stockmeyer became RSM emeritus, November 2002.
Organization
David Williamson (Manager) CS Principles and Methodologies
Daniela de Farias (Postdoc)
Sally Floyd (Administrative Assistant)
Jim Hafner (On Leave)
Sridhar Rajagopalan (On Leave)
Larry Stockmeyer (Emeritus)
Andrew Tomkins (Manager) Information Management Principles
Ziv Bar-Yossef
David Gibson
Ravi Kumar
Kevin McCurley
Ron Fagin (Manager) Foundations of Computer Science
Miki Ajtai
T. S. Jayram
Nimrod Megiddo
D. Sivakumar

