Skip to main content

Activity Report

Activity Report, January - June 2005


Highlights

The predictive power of online chatter
An increasing fraction of the global discourse is migrating online in the form of blogs, bulletin boards, web pages, wikis, editorials, and a dizzying array of new collaborative technologies. The migration has now proceeded to the point that topics reflecting certain individual products are sufficiently popular to allow targeted online tracking of the ebb and flow of chatter around these topics. Based on an analysis of around half a million sales rank values for 2,340 books over a period of four months, and correlating postings in blogs, media, and web pages, Dan Gruhl, R. Guha, Ravi Kumar, Jasmine Novak and Andrew Tomkins were able to draw several interesting conclusions. First, carefully hand-crafted queries produce matching postings whose volume predicts sales ranks. Second, these queries can be automatically generated in many cases. And third, even though sales rank motion might be difficult to predict in general, algorithmic predictors can use online postings to successfully predict spikes in sales rank. A paper describing this system was accepted to the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2005).

Efficient Implementation of Large-Scale Multi-Structural Databases
Earlier work defined ``multi-structural databases," a data model to support efficient analysis of large, complex data sets over multiple numerical and hierarchical dimensions. Three types of queries over this data model were defined, each of which required solving an optimization problem. An example is to find the ten most significant non-overlapping regions of geography crossed with time in which coverage of the Olympics was much stronger in newspapers than online sources. Ron Fagin, Phokion Kolaitis, Ravi Kumar, Jasmine Novak, D. Sivakumar, and Andrew Tomkins present a general query framework capturing the previous three queries as part of a much broader family. They then give efficient algorithms for particular subclasses of this family. Finally, they describe an implementation of these algorithms that operates on a collection of several billion web documents. Using the algorithms in conjunction with random sampling techniques, the system can solve these queries in real time. A paper describing this system was accepted to the Proceedings of the 31st International Conference on Very Large Databases (VLDB 2005).

Discovering Large Dense Subgraphs in Massive Graphs
David Gibson, Ravi Kumar and Andrew Tomkins designed an algorithm that is based on a recursive application of fingerprinting via shingles, and is extremely efficient, capable of handling graphs with tens of billions of edges on a single machine with modest resources. They apply their algorithm to characterize the large, dense subgraphs of a graph showing connections between hosts on the World Wide Web. Upon examination, many of the dense subgraphs output by the algorithm are link spam, i.e., websites that attempt to manipulate search engine rankings through aggressive interlinking to simulate popular content. They therefore propose dense subgraph extraction as a useful primitive for spam detection. A paper describing this system was accepted to the Proceedings of the 31st International Conference on Very Large Databases (VLDB 2005).

The volume and evolution of web page templates
Web pages contain a combination of unique content and template material, which is present across multiple pages and used primarily for formatting, navigation, and branding. David Gibson, Andrew Tokmkins and Kunal Punera study the nature, evolution, and prevalence of these templates on the web. As part of this work, they develop new randomized algorithms for template extraction that perform approximately twenty times faster than existing approaches with similar quality. Their results show that 40-50% of the content on the web is template content. Templates are growing as a fraction of total content at a rate of between 6 and 8% per year. A paper describing this system was accepted to the Proceedings of the 14th International World Wide Web Conference (WWW2005).

Peer Data Exchange
A. Fuxman, P. Kolaitis, R.J. Miller, and W.-C. Tan introduced and studied a framework, called peer data exchange, for sharing and exchanging data between peers. This framework is a special case of a full-fledged peer data management system and a generalization of data exchange between a source schema and a target schema. The motivation behind peer data exchange is to model authority relationships between peers, where a source peer may contribute data to a target peer, specified using source-to-target constraints, and a target peer may use target-to-source constraints to restrict the data it is willing to receive, but cannot modify the data of the source peer. Results about several key algorithmic and complexity-theoretic issues in peer data exchange were presented in a paper on peer data exchange that appeared in the proceedings of the 2005 ACM Symposium on Principles of Database Systems (PODS 2005); the full version of this paper was subsequently invited to the ACM Transactions on Database Systems.

Publications and Conferences

M.Ajtai, Representing Hard Lattices with O(n log n) Bits., Proc. of the 37th ACM Symposium on Theory of Computing (STOC 2005), pp 94-103, Baltimore, May 2005.

Z. Bar-Yossef, T. Kanungo, R. Krauthgamer, Focused Sampling: Computing Topical Web Statistics, IBM Research Report RJ 10339, February 2005.

S. Chawla, R. Krauthgamer, R. Kumar, Y. Rabani, and D. Sivakumar, On the hardness of approximating multicut and sparsest-cut, Proc. of 20th Annual IEEE Conference on Computational Complexity (CCC 2005),, pp. 144-153, June 2005.

R. Fagin, R. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins, Multi-structural databases. Proc. 24th ACM Symposium on Principles of Database Systems (PODS 2005), pp. 184-195, Baltimore, June 2005.

R. Fagin, P. Kolaitis, R. J. Miller, and L. Popa, Data exchange: semantics and query answering. Theoretical Computer Science 336 (2005), pp. 89-124. (Special issue for selected papers from ICDT 2003).

R. Fagin, P. Kolaitis, L. Popa: Data exchange: getting to the core. ACM Trans. Database Syst. 30(1): 174-210 (2005) (Invited paper from PODS 2003)

A. Fuxman, P. Kolaitis, R. J. Miller, W. C. Tan: Peer Data Exchange. Proc. of the 24th Annual ACM Symposium on Principles of Database Systems (PODS 2005), pp. 160-171, June 2005.

D. Gibson, R. Kumar, K. McCurley and A. Tomkins, Dense Subgraph Extraction, in Graph Data Mining, L. Holder (ed), Wiley, 2005.

D. Gibson, K. Punera, and A. Tomkins, The volume and evolution of web page templates", Proc. of the 14th International World Wide Web Conference (WWW2005), pp. 830-839, May 2005.

P. Kolaitis: Schema Mappings, Data Exchange, and Metadata Management, Proc. of the 24th Annual ACM Symposium on Principles of Database Systems (PODS 2005), pp. 61-75, June 2005 (Invited paper).

Patents

N. Megiddo and X. Zhu, System and Method for Improving the Effectiveness of Web Advertising, Issued as Patent ZL01132594.1 in China, 1/5/2005, as Patent 6892181 in US, 5/10/2005, and as Patent 102634 in Singapore, 5/31/2005.

N. Megiddo, System and Method for Profiling Access to Disk Drive Commands Based on Dual Servo Mode Model, Issued as Patent 6898665 in US, 5/24/2005.

Invited Talks

  • P. Kolaitis
    "Schema Mappings, Data Exchange, and Metadata Management", Invited Talk, 24th Annual ACM Symposium on Principles of Database Systems (PODS 2005).

Presentations and Seminars

  • R. Fagin
    "Finite model theory - a personal perspective", Berkeley logic colloquium, March 2005.
  • P. Kolaitis
    "On Preservation under Homomorphisms in the Finite", UCLA Logic Colloquium, May 2005.
    "Schema Mappings and Data Exchange", Workshop on Data Integration, IBM Rome, April 2005.
  • R. Krauthgamer
    "A Metric Notion of Dimension and its Algorithmic Applications", UCLA theory seminar, April 2005.

Committee Work

  • R. Fagin
    Program committee chair, ACM Symposium on Theory of Computing (STOC 2005)
  • R. Kumar
    Program committee member, ACM Symposium on Theory of Computing (STOC 2005) Organizing committee, IEEE Conference on Computational Complexity (CCC 2005) Program committee member, Workshop on Knowledge Discovery in the Web (Webkdd 2005)
  • N. Meggido
    Program Committee Co-chair, 1st International Conference on Algorithmic Applications in Management (AAIM 2005), Xian, China, June 22-25, 2005.

Editorships


External Positions


Personnel

Short-Term Visitors

  • Daniela Pucci de Farias (Jan 18-Jan 24, 2005)
  • Albert Atserias (Jan 31-Feb 11, 2005)
  • Yuval Rabani (Feb 8-Feb 23, 2005)
  • Eli Ben-Sasson (Apr 18-May 5, 2005)


Research Interests

  • M. Ajtai: Complexity theory, combinatorics, and logic, and the connections between them
  • R. Fagin: Applications of logic to computer science, reasoning about knowledge, database theory, finite-model theory
  • D. Gibson: Hypermedia analysis and algorithms, indexing, web search, visualization
  • T. S. Jayram: Computational complexity, approximation/online algorithms, optimal control
  • P. Kolaitis: Logic in computer science, database theory, computational complexity
  • R. Krauthgamer: Analysis and design of algorithms, embeddings of metric spaces, computational complexity
  • N. Megiddo: Mathematical programming, algorithms and complexity, game theory
  • S. Rajagopalan: Algorithm analysis and design, learning, information and coding theory, randomization in computing
  • E. Vee: Complexity theory, database theory, optimization


Organization

  • Phokion Kolaitis (Senior Manager) CS Principles and Methodologies
  • René Henthorn (Administrative Assistant)
  • David Gibson
  • Erik Vee
  • Miki Ajtai
  • Nimrod Megiddo
  • Robert Krauthgamer
  • Ron Fagin (Manager) Foundations of Computer Science
  • Sridhar Rajagopalan (On Leave at the Content & Knowledge Management Department)
  • T. S. Jayram


Transitions (in chronological order)

  • Sridhar Rajagopalan is on leave to work in the Content and Knowledge Management Department at ARC.
  • Cheryl Rothrock left the department in January 2005, transferring to Almaden Services Research as an administrative assistant to Jim Spohrer and Paul Maglio.
  • René Henthorn became the department's administrative assistant in March 2005.
  • R. Guha moved to Google in April 2005.
  • Ziv Bar-Yossef left IBM and stayed at the Technion, Israel, following the end of his leave of absence in June 2005.