## Overview

There is a long history of research at IBM on the theory of databases. The relational model was invented, developed, and studied extensively at the IBM San Jose Research Lab, which evolved into the IBM Research - Almaden. In recent years, one of the focus areas of our database research in the Theory Group has been data exchange, a process that converts data from one representation to another. The source representation might be relational, and the target might be another relational representation, or XML. We have also done recent work in other areas of database theory, including querying, rank aggregation, and privacy.

## Papers on Data Exchange

**Data Exchange: Semantics and Query Answering.**
Since the source and target schemas in data exchange
each have their own constraints (such as keys), there may not
exist a solution, that is, a target database that is
consistent with the specifications. Or, instead of no
solutions, there may be many solutions. Ron Fagin,
Phokion Kolaitis, Renee Miller (U. Toronto), and
Lucian Popa gave an algebraic
specification that selects, among all possible solutions for a
given source instance, a special class of solutions that they
call "universal". They showed that a universal solution has
"good" properties that justify its choice for the semantics of
the data exchange problem. They identified fairly general, yet
practical, sufficient conditions that guarantee the existence
of a universal solution. They gave a polynomial-time algorithm
that determines whether a solution exists and, if so, produces
a universal solution.
The paper
appeared in the 2003 International
Conference on Database Theory (ICDT 03);
the
full
version
of this paper has subsequently
appeared in a special issue of Theoretical Computer Science.

**Data
Exchange: Getting to the Core.**
There
may be many universal solutions of the type just described.
This naturally raises the question of whether there is a
"best" universal solution, and hence a best solution for data
exchange. Ron Fagin, Phokion Kolaitis, and
Lucian Popa have answered this question by considering the
notion of the "core" of a structure, a notion that was first
studied in graph theory. The core of a structure is the
smallest substructure that is also a homomorphic image of the
structure. They showed that all universal solutions have the
same core (up to isomorphism) and that this core is also a
universal solution, and hence the smallest universal solution.
The uniqueness of the core of a universal solution together
with its minimality make the core an ideal solution for data
exchange. Furthermore, the core can be proven to be the best
among all universal solutions for answering a broad range of
queries. This paper
appeared in the 2003
ACM Symposium on Principles of Database Systems (PODS 03);
the
full
version
of this paper has subsequently
appeared in a special issue of ACM TODS.

**Composing Schema Mappings: Second-Order Dependencies to the Rescue.**
In a
data exchange scenario as above,
a schema mapping is a specification that describes how
data structured under one schema
(the source schema)
is to be transformed into
data structured under a different
schema (the target schema).
Schema mappings
play a key role in numerous areas of database systems,
where data is converted from one format to another.
Ron Fagin, Phokion Kolaitis, Lucian Popa, and Wang-Chiew Tan (UC Santa
Cruz) gave a rigorous semantics to the composition
of schema mappings. They investigated the definability and
computational complexity of the composition of two schema mappings,
and obtained some surprises along the way.
This paper
appeared in the 2004
ACM Symposium on Principles of Database Systems (PODS 04);
the
full
version
of this paper has subsequently
appeared in a special issue of ACM TODS.

**Locally Consistent Transformations and Query Answering in Data Exchange.**
Given
a source
instance in a data exchange scenario as above, there may be many
solutions
-- target instances that
satisfy the constraints of the data exchange problem.
The
"certain answers" to a query over the target schema
are those tuples that
appear in every possible solution.
Marcelo Arenas (U. Toronto), Pablo Barcelo (U. Toronto), Ron Fagin, and Leonid
Libkin (U. Toronto) studied the
question of when a single solution, that is, a single target database,
can be used to find the certain answers.
This paper
appeared in the 2004
ACM Symposium on Principles of Database Systems (PODS 04).

**Peer data exchange.**
Ariel Fuxman (U. Toronto), Phokion Kolaitis, Renee Miller (U. Toronto), and Wang-Chiew Tan (UC Santa Cruz)
introduced and studied a framework,
called peer data exchange, for sharing and exchanging data between peers.
This framework is a special case of a full-fledged peer data management system
and a generalization of data exchange between a source schema and a target schema.
The motivation behind peer data exchange is to model authority relationships between peers,
where a source peer may contribute data to a target peer, specified using source-to-target
constraints, and a target peer may use target-to-source constraints to restrict the data
it is willing to receive, but cannot modify the data of the source peer.
Results about several key algorithmic and complexity-theoretic issues in
peer data exchange were presented in a paper on peer data exchange that
appeared in the proceedings of the 2005 ACM Symposium on Principles of Database Systems (PODS 05).

**Inverting schema mappings.**
Although the notion of an inverse of a schema mapping is important,
the exact definition of an inverse mapping is somewhat elusive.
This is because a schema mapping may associate many target instances
with each source instance, and many source instances with each target instance.
Based on the notion that the composition of a mapping and its inverse is the identity,
Ron Fagin gave a formal definition for what it means for a schema mapping M'
to be an inverse of a schema mapping M for a class S of source instances.
If S is the class of all source instance, then M' is a "global inverse" of M.
He gave a construction that produces a global inverse if one exists.
This work appeared in the 2006 ACM Symposium on Principles of Database Systems (PODS 06);
the
full
version
of this paper has subsequently
appeared in a special issue of ACM TODS.

**Quasi-inverses of schema mappings.**
Two operators on schema mappings, namely the composition operator and the inverse operator, are regarded
as especially important. Progress on the study of the inverse operator was not made until very recently,
as even finding the exact semantics of this operator turned out to be a fairly delicate task. Furthermore,
this notion is rather restrictive, since it is rare that a schema mapping possesses an inverse.
Ron Fagin, Phokion Kolaitis, Lucian Popa, and Wang-Chiew Tan (UC Santa Cruz) introduced and studied
the notion of a quasi-inverse of a schema mapping. This notion is a principled relaxation of the
notion of an inverse of a schema mapping; intuitively, it is obtained from the notion of an inverse by
not differentiating between instances that are equivalent for data-exchange purposes.
They derived a number of properties, including a necessary and sufficient combinatorial
condition for the existence of a quasi-inverse.
A paper describing these results was presented at the 26th ACM Symposium on Principles of Database Systems
(PODS 2007); the
full
version
of this paper has subsequently
appeared in a special issue of ACM TODS.

**Towards a theory of schema mapping optimization.**
Even though several different aspects of schema mappings
have been explored in considerable depth, the study of schema-mapping optimization remains
largely uncharted territory to date. In this work, Ron Fagin, Phokion Kolaitis,
Alan Nash, and Lucian Popa lay the foundation for the development of a theory of
schema-mapping optimization. They introduce several useful notions of "equivalence"
between schema mappings. They show under what conditions a more complex schema mapping
is equivalent to a much simpler schema mapping. A
paper
describing these results was
presented at the 27th ACM Symposium on Principles of Database Systems (PODS 2008).

#### Other Recent Database Theory Papers

**Bag-containment of conjunctive queries.**
The containment problem for conjunctive queries (sometimes known as SELECT-PROJECT-JOIN queries)
is a fundamental problem in query optimization. In many query processing languages, e.g. SQL,
queries return multisets,or bags, of answers. Despite this, much of the work on
query containment treats queries as returning sets. T.S. Jayram, Phokion Kolaitis,
and Erik Vee study the bag-containment question, where queries are treated as returning bags of tuples.
In their study of the problem, they developed several major insights on the problem
and they showed that determining whether one conjunctive query with inequalities is
bag-contained in another such query is an undecidable problem.
The results were presented at the 2006 ACM Symposium on Principles of Database Systems (PODS 06)

**OLAP Over Uncertain and Imprecise Data.**
The OLAP paradigm, which is very useful in analyzing structured/unstructured
information in large-scale business applications, does not provide support
for uncertainty and imprecision in the data. Doug Burdick (U. Wisconsin,
Madison) Prasad M. Deshpande, T.S. Jayram, Raghu Ramakrishnan (U. Wisconsin,
Madison) and Shivakumar Vaithyanathan
addressed two significant challenges in this context: (a) the uncertainty associated
with measures derived from text and (b) the imprecision associated with dimension attributes.
Two key requirements were postulated for any OLAP-based solution to this problem:
(i) the answers must be consistent at different levels of the dimension hierarchy,
and (ii) the answers must be faithful to the underlying input data.
These requirements were formalized and an allocation-based framework
was proposed to satisfy consistency and faithfulness. A key aspect is
the identification of correlation-preserving criteria that guides the
choice of appropriate allocation policies. The resulting solution can
be mapped to an extended database that
generalizes the standard star schema used in OLAP. Efficient algorithms
that operate on the extended database were designed for standard aggregation
operators of OLAP; the complexity analysis showed that they are comparable
to those used in conventional OLAP. A
paper
describing these results
appeared in the
2005 International Conference on Very Large Databases
(VLDB 05).

**Multi-structural databases.**
We introduced the "Multi-Structural Database", a rich new data framework to support
efficient analysis of large, complex data sets. An instance of the model consists
of a set of data objects, together with a schema that specifies segmentations of
the set of data objects according to multiple distinct criteria (e.g., into a
taxonomy based on a hierarchical attribute). Within this model, we developed
a rich set of analytical operations and designed highly efficient algorithms for these operations.
Our operations were formulated as optimization problems, that allow the user to
analyze the underlying data in terms of the allowed segmentations.
An example is to find the ten most significant non-overlapping regions of geography crossed with
time in which coverage of the Olympics was much stronger in newspapers than online sources.
This work led to papers in two conferences: the
first
paper
(by Ron
Fagin, R. Guha, Ravi Kumar,
Jasmine Novak, D. Sivakumar, and Andrew Tomkins) appeared in
the 2005 ACM Symposium on Principles of Database Systems (PODS 05),
and introduced the
framework and presented some efficient algorithms for answering queries.
The
second paper
(by Ron
Fagin, Phokion Kolaitis, Ravi Kumar, Jasmine Novak,
D. Sivakumar, and Andrew Tomkins) appeared in
the 2005 International Conference on Very Large Databases
(VLDB 05), and gave a much
broader class of queries, and efficient algorithms for answering these queries.
We tested our algorithms on collections of several billion web
documents,
and showed that we could answer these very complex queries in seconds.

**Comparing and aggregating rankings with ties.**
Rank aggregation is a useful abstraction
that has several applications, including meta-search, synthesizing
rank functions from multiple indices, similarity search, and
classification. In database applications (catalog searches, fielded
searches, parametric searches, etc.), the rankings are produced by
sorting an underlying database according to various fields. Typically,
there are a number of fields that each have very few distinct values,
and hence the corresponding rankings have many ties in them. Known
methods for rank aggregation are poorly suited to this context, and
the difficulties can be traced back to the fact that we do not have
sound mathematical principles to compare two "partial rankings", that
is, rankings that allow ties.
Ron Fagin, Ravi Kumar, Mohammad Mahdian (MIT), D. Sivakumar, and Erik
Vee have
provided a comprehensive picture
of how to compare partial rankings.
They proposed several metrics to compare
partial rankings, presented algorithms that
efficiently compute them, and proved that
they are within constant multiples of each other.
Based on these concepts, they formulated aggregation
problems for partial rankings, and developed a highly
efficient algorithm to compute the top few elements
of a near-optimal aggregation of multiple partial rankings.
This paper
appeared in the 2004
ACM Symposium on Principles of Database Systems (PODS 04).

**Epistemic privacy.**
Most known definitions of privacy are focused either on theoretical soundness,
or on practical usefulness, but not on both. In an attempt to reduce the gap
between these two aspects, Alexandre Evfimievski, Ronald Fagin, and David Woodruff
introduced and studied a novel privacy definition, set in the framework of offline
(retroactive) database query auditing. According to this definition, an audited
property *A* is private, given the disclosure of a fact *B*,
if no user can gain confidence in *A* by learning *B*, subject
to prior knowledge constraints. Database users are modeled as either possibilistic agents
whose knowledge is a set of possible databases, or as probabilistic agents whose knowledge
is a probability distribution on possible databases. The new privacy notion is shown
to be significantly more flexible than earlier approaches, and criteria are derived
that allow an auditor to test privacy efficiently in some important cases.
A
paper
describing these results was
presented at the 27th ACM Symposium on Principles of Database Systems (PODS 2008).