Skip to main content

Database Principles

Overview

There is a long history of research at IBM on the theory of databases. The relational model was invented, developed, and studied extensively at the IBM San Jose Research Lab, which evolved into the IBM Research - Almaden. In recent years, one of the focus areas of our database research in the Theory Group has been data exchange, a process that converts data from one representation to another. The source representation might be relational, and the target might be another relational representation, or XML. We have also done recent work in other areas of database theory, including querying, rank aggregation, and privacy.

Papers on Data Exchange

Data Exchange: Semantics and Query Answering. Since the source and target schemas in data exchange each have their own constraints (such as keys), there may not exist a solution, that is, a target database that is consistent with the specifications. Or, instead of no solutions, there may be many solutions. Ron Fagin, Phokion Kolaitis, Renee Miller (U. Toronto), and Lucian Popa gave an algebraic specification that selects, among all possible solutions for a given source instance, a special class of solutions that they call "universal". They showed that a universal solution has "good" properties that justify its choice for the semantics of the data exchange problem. They identified fairly general, yet practical, sufficient conditions that guarantee the existence of a universal solution. They gave a polynomial-time algorithm that determines whether a solution exists and, if so, produces a universal solution. The paper appeared in the 2003 International Conference on Database Theory (ICDT 03); the download content in pdf formatfull version of this paper has subsequently appeared in a special issue of Theoretical Computer Science.

Data Exchange: Getting to the Core. There may be many universal solutions of the type just described. This naturally raises the question of whether there is a "best" universal solution, and hence a best solution for data exchange. Ron Fagin, Phokion Kolaitis, and Lucian Popa have answered this question by considering the notion of the "core" of a structure, a notion that was first studied in graph theory. The core of a structure is the smallest substructure that is also a homomorphic image of the structure. They showed that all universal solutions have the same core (up to isomorphism) and that this core is also a universal solution, and hence the smallest universal solution. The uniqueness of the core of a universal solution together with its minimality make the core an ideal solution for data exchange. Furthermore, the core can be proven to be the best among all universal solutions for answering a broad range of queries. This paper appeared in the 2003 ACM Symposium on Principles of Database Systems (PODS 03); the link to content in pdffull version of this paper has subsequently appeared in a special issue of ACM TODS.

Composing Schema Mappings: Second-Order Dependencies to the Rescue. In a data exchange scenario as above, a schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a different schema (the target schema). Schema mappings play a key role in numerous areas of database systems, where data is converted from one format to another. Ron Fagin, Phokion Kolaitis, Lucian Popa, and Wang-Chiew Tan (UC Santa Cruz) gave a rigorous semantics to the composition of schema mappings. They investigated the definability and computational complexity of the composition of two schema mappings, and obtained some surprises along the way. This paper appeared in the 2004 ACM Symposium on Principles of Database Systems (PODS 04); the download content in pdf formatfull version of this paper has subsequently appeared in a special issue of ACM TODS.

Locally Consistent Transformations and Query Answering in Data Exchange. Given a source instance in a data exchange scenario as above, there may be many solutions -- target instances that satisfy the constraints of the data exchange problem. The "certain answers" to a query over the target schema are those tuples that appear in every possible solution. Marcelo Arenas (U. Toronto), Pablo Barcelo (U. Toronto), Ron Fagin, and Leonid Libkin (U. Toronto) studied the question of when a single solution, that is, a single target database, can be used to find the certain answers. This download content in pdf formatpaper appeared in the 2004 ACM Symposium on Principles of Database Systems (PODS 04).

Peer data exchange. Ariel Fuxman (U. Toronto), Phokion Kolaitis, Renee Miller (U. Toronto), and Wang-Chiew Tan (UC Santa Cruz) introduced and studied a framework, called peer data exchange, for sharing and exchanging data between peers. This framework is a special case of a full-fledged peer data management system and a generalization of data exchange between a source schema and a target schema. The motivation behind peer data exchange is to model authority relationships between peers, where a source peer may contribute data to a target peer, specified using source-to-target constraints, and a target peer may use target-to-source constraints to restrict the data it is willing to receive, but cannot modify the data of the source peer. Results about several key algorithmic and complexity-theoretic issues in peer data exchange were presented in a paper on peer data exchange that appeared in the proceedings of the 2005 ACM Symposium on Principles of Database Systems (PODS 05).

Inverting schema mappings. Although the notion of an inverse of a schema mapping is important, the exact definition of an inverse mapping is somewhat elusive. This is because a schema mapping may associate many target instances with each source instance, and many source instances with each target instance. Based on the notion that the composition of a mapping and its inverse is the identity, Ron Fagin gave a formal definition for what it means for a schema mapping M' to be an inverse of a schema mapping M for a class S of source instances. If S is the class of all source instance, then M' is a "global inverse" of M. He gave a construction that produces a global inverse if one exists. This work appeared in the 2006 ACM Symposium on Principles of Database Systems (PODS 06); the download content in pdf formatfull version of this paper has subsequently appeared in a special issue of ACM TODS.

Quasi-inverses of schema mappings. Two operators on schema mappings, namely the composition operator and the inverse operator, are regarded as especially important. Progress on the study of the inverse operator was not made until very recently, as even finding the exact semantics of this operator turned out to be a fairly delicate task. Furthermore, this notion is rather restrictive, since it is rare that a schema mapping possesses an inverse. Ron Fagin, Phokion Kolaitis, Lucian Popa, and Wang-Chiew Tan (UC Santa Cruz) introduced and studied the notion of a quasi-inverse of a schema mapping. This notion is a principled relaxation of the notion of an inverse of a schema mapping; intuitively, it is obtained from the notion of an inverse by not differentiating between instances that are equivalent for data-exchange purposes. They derived a number of properties, including a necessary and sufficient combinatorial condition for the existence of a quasi-inverse. A paper describing these results was presented at the 26th ACM Symposium on Principles of Database Systems (PODS 2007); the download content in pdf formatfull version of this paper has subsequently appeared in a special issue of ACM TODS.

Towards a theory of schema mapping optimization. Even though several different aspects of schema mappings have been explored in considerable depth, the study of schema-mapping optimization remains largely uncharted territory to date. In this work, Ron Fagin, Phokion Kolaitis, Alan Nash, and Lucian Popa lay the foundation for the development of a theory of schema-mapping optimization. They introduce several useful notions of "equivalence" between schema mappings. They show under what conditions a more complex schema mapping is equivalent to a much simpler schema mapping. A download content in pdf formatpaper describing these results was presented at the 27th ACM Symposium on Principles of Database Systems (PODS 2008).


Other Recent Database Theory Papers

Bag-containment of conjunctive queries. The containment problem for conjunctive queries (sometimes known as SELECT-PROJECT-JOIN queries) is a fundamental problem in query optimization. In many query processing languages, e.g. SQL, queries return multisets,or bags, of answers. Despite this, much of the work on query containment treats queries as returning sets. T.S. Jayram, Phokion Kolaitis, and Erik Vee study the bag-containment question, where queries are treated as returning bags of tuples. In their study of the problem, they developed several major insights on the problem and they showed that determining whether one conjunctive query with inequalities is bag-contained in another such query is an undecidable problem. The results were presented at the 2006 ACM Symposium on Principles of Database Systems (PODS 06)

OLAP Over Uncertain and Imprecise Data. The OLAP paradigm, which is very useful in analyzing structured/unstructured information in large-scale business applications, does not provide support for uncertainty and imprecision in the data. Doug Burdick (U. Wisconsin, Madison) Prasad M. Deshpande, T.S. Jayram, Raghu Ramakrishnan (U. Wisconsin, Madison) and Shivakumar Vaithyanathan addressed two significant challenges in this context: (a) the uncertainty associated with measures derived from text and (b) the imprecision associated with dimension attributes. Two key requirements were postulated for any OLAP-based solution to this problem: (i) the answers must be consistent at different levels of the dimension hierarchy, and (ii) the answers must be faithful to the underlying input data. These requirements were formalized and an allocation-based framework was proposed to satisfy consistency and faithfulness. A key aspect is the identification of correlation-preserving criteria that guides the choice of appropriate allocation policies. The resulting solution can be mapped to an extended database that generalizes the standard star schema used in OLAP. Efficient algorithms that operate on the extended database were designed for standard aggregation operators of OLAP; the complexity analysis showed that they are comparable to those used in conventional OLAP. A download content in pdf formatpaper describing these results appeared in the 2005 International Conference on Very Large Databases (VLDB 05).

Multi-structural databases. We introduced the "Multi-Structural Database", a rich new data framework to support efficient analysis of large, complex data sets. An instance of the model consists of a set of data objects, together with a schema that specifies segmentations of the set of data objects according to multiple distinct criteria (e.g., into a taxonomy based on a hierarchical attribute). Within this model, we developed a rich set of analytical operations and designed highly efficient algorithms for these operations. Our operations were formulated as optimization problems, that allow the user to analyze the underlying data in terms of the allowed segmentations. An example is to find the ten most significant non-overlapping regions of geography crossed with time in which coverage of the Olympics was much stronger in newspapers than online sources. This work led to papers in two conferences: the download content in pdf formatfirst paper (by Ron Fagin, R. Guha, Ravi Kumar, Jasmine Novak, D. Sivakumar, and Andrew Tomkins) appeared in the 2005 ACM Symposium on Principles of Database Systems (PODS 05), and introduced the framework and presented some efficient algorithms for answering queries. The download content in pdf formatsecond paper (by Ron Fagin, Phokion Kolaitis, Ravi Kumar, Jasmine Novak, D. Sivakumar, and Andrew Tomkins) appeared in the 2005 International Conference on Very Large Databases (VLDB 05), and gave a much broader class of queries, and efficient algorithms for answering these queries. We tested our algorithms on collections of several billion web documents, and showed that we could answer these very complex queries in seconds.

Comparing and aggregating rankings with ties. Rank aggregation is a useful abstraction that has several applications, including meta-search, synthesizing rank functions from multiple indices, similarity search, and classification. In database applications (catalog searches, fielded searches, parametric searches, etc.), the rankings are produced by sorting an underlying database according to various fields. Typically, there are a number of fields that each have very few distinct values, and hence the corresponding rankings have many ties in them. Known methods for rank aggregation are poorly suited to this context, and the difficulties can be traced back to the fact that we do not have sound mathematical principles to compare two "partial rankings", that is, rankings that allow ties. Ron Fagin, Ravi Kumar, Mohammad Mahdian (MIT), D. Sivakumar, and Erik Vee have provided a comprehensive picture of how to compare partial rankings. They proposed several metrics to compare partial rankings, presented algorithms that efficiently compute them, and proved that they are within constant multiples of each other. Based on these concepts, they formulated aggregation problems for partial rankings, and developed a highly efficient algorithm to compute the top few elements of a near-optimal aggregation of multiple partial rankings. This download content in pdf formatpaper appeared in the 2004 ACM Symposium on Principles of Database Systems (PODS 04).

Epistemic privacy. Most known definitions of privacy are focused either on theoretical soundness, or on practical usefulness, but not on both. In an attempt to reduce the gap between these two aspects, Alexandre Evfimievski, Ronald Fagin, and David Woodruff introduced and studied a novel privacy definition, set in the framework of offline (retroactive) database query auditing. According to this definition, an audited property A is private, given the disclosure of a fact B, if no user can gain confidence in A by learning B, subject to prior knowledge constraints. Database users are modeled as either possibilistic agents whose knowledge is a set of possible databases, or as probabilistic agents whose knowledge is a probability distribution on possible databases. The new privacy notion is shown to be significantly more flexible than earlier approaches, and criteria are derived that allow an auditor to test privacy efficiently in some important cases. A download content in pdf formatpaper describing these results was presented at the 27th ACM Symposium on Principles of Database Systems (PODS 2008).

[an error occurred while processing this directive]