|
 |
IBM Almaden Research Center
Scalable Database Architectures

We are investigating new architectures for database systems,
along two dimensions:
- Large-scale parallelism over non-dedicated computers, driven
by rapid increases in interconnect bandwidth.
- Embedding data processing functionality in storage
devices, so as to push down data reduction operations.
Embedded Query Processing
There is an increasing trend for IT configurations to include very
powerful storage servers. At the same time, the traditional
shared-nothing style of parallel query processing has
serious scalability limitations (see VLDB'05 paper below).
So, we are investigating an alternate componentization of the
database system archtitecture where some data intensive operations are
embedded into processors close to the disk: SAN controllers,
RAID controllers, or even lower, and other operations are pulled out into
separate compute blades. |
Parallel Querying with Non-Dedicated Computers |
We are investigating Data In The Network (DITN),
a new method of parallel
querying based on dynamic outsourcing of
join processing tasks to non-dedicated, heterogeneous
computers. In DITN, partitioning is
not the means of parallelism. Data layout decisions
are taken outside the scope of the DBMS,
and handled within the storage software; query
processors see a Data-In-The-Network image.
This allows gradual scaleout as the workload
grows, by using non-dedicated computers.
A typical operator in a parallel query plan
is the Exchange operator that migrates tuples between nodes. One of our main arguments is that Exchange
is unsuitable for non-dedicated machines because
it poorly addresses node heterogeneity,
and is vulnerable to failures or load spikes
during query execution. DITN uses an alternate
intra-fragment parallelism where each
node executes an independent select-project-join-
aggregate-group by block, with no tuple
exchange between nodes. This method cleanly
handles heterogeneous nodes, and well adapts
during execution to node failures or load spikes.
Initial experiments suggest that DITN performs
competitively with a traditional configuration of
dedicated machines and well-partitioned data
for up to 10 processors at least. At the same
time, DITN gives significant flexibility in terms
of gradual scaleout and handling of heterogeneity,
load bursts, and failures.
|
Blue Gene Middleware |
Blue Gene/L, IBM's high-performance multiprocessor computer, is traditionally
used within scientific computing for the physical sciences such as
chemistry and biology. Our goal in this project is to design, implement,
and evaluate middleware components that will allow financial applications
to be run on Blue Gene/L. This work was motivated by the requirements
of a specific proprietary application involving an asset portfolio
Value-at-Risk (VaR) computation using Monte Carlo simulation.
Financial applications present technical and usage requirements
beyond that of scientific computing, including dependence on external data
sources such as SQL databases, interaction within larger business workflows,
and separation of high-level service specifications from low-level service
provisioning. In order to support such applications on Blue Gene/L, our
middleware, written in Java and running on a Blue Gene front-end node,
provides a number of core features such as an automated data extraction and
staging gateway, a standardized high-level job specification schema, a
well-defined web services (SOAP) API for interoperability with other
applications, and a secure HTML/JSP web-based interface suitable for general
users (i.e., non-developers). |
XG: a computation grid for enterprise-scale mining |
We designed and implemented a novel architecture for data processing based on
a functional fusion between a data layer and a computation layer. We show how
such an architecture can be leveraged to offer significant speedups for data
processing jobs such as data analysis and mining over large data sets.
One novel contribution of our solution is its data-driven approach. The
computation infrastructure is controlled from within the data layer. Grid
computational job submission events are based within the query processor on
the DBMS side and in effect controlled by the data processing job to be
performed. This approach allows the early deployment of on-the-fly data
aggregation techniques, minimizing the amount of data to be transfered to/from
compute nodes and is in stark contrast to existing Grid solutions that
interact with data layers mainly as external "storage".
|
For further information
- Evolving Toward the Perfect
Schedule: Co-scheduling Job Assignments and Data Replication in Wide-Area
Systems Using a Genetic Algorithm. (T. Phan, K. Ranganathan, and R. Sion)
Proceedings of the Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP), 2005.
-
Parallel Querying with Non-Dedicated Computers. (V. Raman, W. Han, and I. Narang). International Conference on Very Large Data Bases (VLDB), 2005.
- XG: A Data-driven
Computation Grid for Enterprise-Scale Mining. (R. Sion, R. Natarajan, I. Narang, W-S. Li, and T. Phan) Proceedings of DEXA,
2005.
- Spreading the Load Using Consistent Hashing: A Preliminary Report.
(Garret Swart)
ISPDC/HeteroPar 2004
|
|