Skip to main content
    United States [change]    Terms of use
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

IBM Research - Almaden

Tesla: On-Demand Information Systems

 Metawrapper for Dynamic Federation of Distributed Data Sources

Many applications involve integrating data from multiple data sources and their replicas distributed across multiple nodes. In a distributed (grid) environment, these data sources can be fairly dynamic, in two respects:

  • Replication: The data sources can be replicated in whole or in subset, according to the query workloads and the system loads. These new replicas are created and destroyed by independent entities like the Data Placement Manager, independent of the applications accessing the data.
  • Source Failure/Addition: Distributed data sources are often difficult to maintain under centralized administrative control. New data sources can be independently registered to a grid or leave the grid. Likewise data sources can also fail independently.

Today applications are programmed to access specific data sources, and so any dynamism at the data source can be tackled only by reprogramming the application. For example, SQL applications hardcode the data sources in the nicknames used in their FROM clause. Our goal is to enhance a federated DBMS like DB2 Information Integrator to mask applications from this dynamism, and provide a single system image. This means that the application queries must not specify data sources directly, but instead be dynamically bound against the right set of data sources and replicas at run time. The data sources are chosen based on the query predicates, and the replicas are chosen based on the application's staleness tolerance (specified as part of its QOS requirement). Depending on the application's QOS needs, the DBMS should also mask data source failures during query execution, by gracefully degrading the query results into partial query results.

There are many challenges in doing such dynamic binding:

  • Cost-Based Replica Selection: When choosing among alternative replicas, we need to involve the cost-based optimizer rather than use heuristics. E.g., different replicas could contain joins of different subsets of tables, and be located on machines of varying speeds. Choosing a replica requires understanding what portion of the overall query plan can be "pushed down" into the replica.
  • Runtime Binding: In some applications the information needed to choose the source is not fully available even at compile (optimization) time. One reason for this is that the source selection may depend on dynamic quantities like parameter markers, or even outer variables from bind joins. Another reason is that the replica chosen at compile time may be overloaded or unavailable at runtime. So we want some flexibility in choosing sources just before query execution begins.
  • Adaptation during query execution: In some cases we want to adapt the set of sources even after the query has started executing. The simplest example of this is graceful degradation to failures -- when running a query over 15 data sources, if we lose the connection to one source 90% of the way through the query, it is often preferable to finish the query with partial results from the available sources. More generally, if a source fails at run time it might be possible to complete the query by switching to a replica.

Logical Domains and MetaWrapper

We introduce the notion of a logical domain as the union of all data sources and replicas providing similar content. Application queries are then written against this logical domain. The logical domain is registered as a nickname backed by a new federated wrapper called a MetaWrapper, which abstracts all data sources and replicas under the domain. The actual sources and replicas are registered in a separate metadata repository which sits on a separate node from the DBMS. The application is hidden from source dynamism because it sees the entire domain as a single table. The MetaWrapper dynamically contacts the metadata repository with the query predicates and QOS requirements to fetch the right set of replicas and sources.

Metawrapper Image

Since wrappers have both compile-time and run-time components, this source/replica selection can be changed even at run-time. At compile-time, the MetaWrapper mediates the Request/Reply/Compensate protocol between the DBMS and the actual source wrappers. The MetaWrapper has no optimization logic, but exposes all plans involving the remote sources and replicas so that the DBMS optimizer can choose the best one. Even though the overall plan is fixed after compile-time, the MetaWrapper still has flexibility to change sources/replicas at run-time as long as the new source can perform the same query fragment accepted at compile-time.

The run-time component of the MetaWrapper redirects fetch requests to the wrappers of each of the matching sources. These fetches are done asynchronously for parallelism and for graceful degradation to failures.

For Further Details

  • Dynamic and Selective data source binding through a metawrapper. Patent application ARC9-2004-0041, 2004.
  • A metawrapper for dynamic binding of distributed data sources.
  • Data Access and Management Services on Grid. V. Raman, I. Narang, C. Crone, L. Haas, S. Malaika, T. Mukai, D. Wolfson, and C. Baru). Database Access and Integration Services Working Group, Global Grid Forum (GGF) 5, 2002.
 Related Projects

    About IBMPrivacyContact