Last updated on 19 March 2009
With each entry in the following list, information is included about where the algorithms described in the corresponding paper/patent have been implemented. IBM's patent licensing policies are described elsewhere. For most of the papers and patents listed here, links to online versions are provided. But to access some of the online papers, an ACM account might be needed. Other papers are available from
C. Mohan at:
|
Another bibliography of almost all of Mohan's papers is maintained at the University of Trier by Michael Ley. It includes some papers that are not listed here and it excludes patents that are listed here.
This paper describes two efficient distributed transaction commit protocols, the Presumed Abort (PA) and Presumed Commit (PC) protocols, which have been implemented in the distributed data base system R*. PA and PC are extensions of the well-known two-phase (2P) commit protocol. PA is optimized for read-only transactions and a class of multi-site update transactions, and PC is optimized for other classes of multi-site update transactions. The optimizations result in reduced inter-site message traffic and log writes, and, consequently, a better response time for such transactions. We derive the new protocols in a step-wise fashion by modifying the 2P protocol.
PA is now part of the ISO-OSI, X/Open XA, Transaction Internet Protocol (TIP) and OMG OTS standards for distributed transaction processing. It is also part of the IBM SNA LU6.2 and DRDA standards. It has been implemented in IBM's R*, DB2, OS/400 and QuickSilver, Tandem's TMF, DEC's VAX/VMS, Transarc's Encina Product Suite, CMU's Camelot, Unix System Laboratories' TUXEDO, Microsoft's DTC, Informix and University of Wisconsin's Shore.
This article presents and discusses the computation and communication model used by R*, a prototype distributed database management system. An R* computation consists of a tree of processes connected by virtual circuit communication paths. The process management and communication protocols used by R* enable the system to provide reliable, distributed transactions while maintaining adequate levels of performance. Of particular interest is the use of processes in R* to retain user context from one transaction to another, in order to improve the system performance and recovery characteristics.
This chapter describes how statements in the SQL language are processed by the R* distributed relational database management system. R* is an experimental adaptation of System R to the distributed environment. The R* prototype is currently operational on multiple machines running the MVS operating system, and is undergoing evaluation. The R* system is a confederation of autonomous, locally-administrated databases that may be geographically dispersed, yet which appear to the user as a single database. Naming conventions permit R* to access tables at remote sites without resorting to a centralized or replicated catalog, and without the user having to specify either the current location of or the communication commands required to access that table. SQL data definition statements affecting remote sites are interpreted through a distributed recursive call mechanism. Tables may be moved physically to other databases without affecting existing SQL statements. SQL data manipulation statements are compiled at each site having a table referenced in the statement, coordinated by the site at which the statement originated. As part of compilation, the distributed optimization process chooses the best place and the best way to access tables and join them together. Optimization uses dynamic programming and careful pruning to minimize total estimated execution cost at all sites, which is a liner combination of CPU, I/O, and communications (both per-message and per-byte) costs.
This article presents an algorithm to refresh the contents of data base snapshots. A data base snapshot is a read-only table whose contents are extracted from other tables in the data base. The snapshot contents can be periodically refreshed to reflect the current state of the data base. Snapshots are useful in many applications as a cost effective substitute for replicated data in a distributed data base system.
When the snapshot contents are a simple restriction and projection of a single base table, differential refresh techniques can reduce the message and update costs of the snapshot refresh operation. The algorithm presented annotates the base table to detect the changes which must be applied to the snapshot table during snapshot refresh. The cost of maintaining the base table annotations is minimal and the amount of data transmitted during snapshot refresh is close to optimal in most circumstances.
This algorithm was implemented in the R* distributed data base management system.
A method is disclosed for reserving space needed to perform "rollback" of actions which free space. Freed space must not be consumed by other transactions if the transaction which freed the space may need the space to undo the effect of the action which freed the space.
This method is implemented in DB2/2 and DB2/6000.
This paper deals with the transaction management aspects of the R* distributed data base system. It concentrates primarily on the description of the R* commit protocols, Presumed Abort (PA) and Presumed Commit (PC). PA and PC are extensions of the well-known two-phase commit protocol. PA is optimized for read-only transactions and a class of multi-site update transactions, and PC is optimized for other classes of multi-site update transactions. The optimizations result in reduced inter-site message traffic and log writes, and, consequently, a better response time. The paper also discusses R*'s approach towards distributed deadlock detection and resolution.
PA is now part of the ISO-OSI and X/Open standards for distributed transaction processing. It is also part of the IBM SNA LU6.2 and DRDA standards. It has been implemented in IBM's R*, DB2 V3 and QuickSilver, Tandem's TMF, DEC's VAX/VMS, Transarc's Encina Product Suite, CMU's Camelot and Unix System Laboratories' TUXEDO.
In this paper, we present a simple and efficient recovery method for nested transactions, called ARIES/NT (Algorithm for Recovery and Isolation Exploiting Semantics for Nested Transactions), that uses write-ahead logging and supports semantically-rich modes of locking and operation logging. ARIES/NT applies to a very general model of nested transactions, which includes partial rollbacks of subtransactions, upward and downward inheritance of locks, and concurrent execution of ancestor and descendent subtransactions. The adopted system architecture encompasses aspects of distributed data base management also. ARIES/NT is an extension of the ARIES recovery and concurrency control method which was originally developed for the single-level transaction model by Mohan et al. and which has been implemented, to varying degrees, in IBM's OS/2 Extended Edition Database Manager, DB2, Workstation Data Save Facility/VM, Starburst and QuickSilver, in Transarc's Encina Product Suite, and in the University of Wisconsin's EXODUS and Gamma data base machine.
Many data base management systems' query optimizers choose at most one index for accessing the records of a table in a given query, even though many indexes may exist on the table. In spite of the fact that there are some systems which use multiple indexes, very little has been published about the concurrency control or query optimization implications (e.g., deciding how many indexes to use) of using multiple indexes. This paper addresses these issues and presents solutions to the associated problems. Techniques are presented for the efficient handling of record ID lists, elimination of some locking, and determination of how many and which indexes to use. The techniques are adaptive in the sense that the execution strategies may be modified at run-time (e.g., not use some indexes which were to have been used), if the assumptions made at optimization-time (e.g., about selectivities) turn out to be wrong. Opportunities for exploiting parallelism are also identified.
A subset of our ideas have been implemented in DB2 V2R2.
Migrating to a distributed computing environment from an existing single system environment generally poses some special problems. This paper presents a case study of problems relating to migrating an existing data base management system (DBMS) to an environment in which all the disks containing the data bases are shared amongst multiple instances of the DBMS. Any DBMS instance in the complex may update any data and each instance has its own log and buffer pool. We describe a simple technique to perform data base recovery correctly in such an environment. In a single system DBMS like DB2, the log component assigns a monotonically increasing value called the log sequence number (LSN) for each log record that is written. The LSN is typically the logical address of the log record in the sequential log file. The DBMS stores in the header of each DB page the LSN of the log record describing the most recent update to that page. This is required for proper recovery after a system failure. In the shared disks environment, we illustrate the problems that would be caused if each system assigned LSNs independently of the other systems. We describe a technique to solve this problem without requiring migration of existing data, a realtime merged log, or communication amongst the systems.
A method for fetching key record data in a group of record keys according to at least a portion of a key record through an index tree is provided. The index tree provides concurrent accesses of record keys by different transactions. The index tree includes a root node connected to at least one level of nodes, each node having a key record reference to one or more nodes in a next successive level and having bottom nodes that provide access to the key data. The method consists of the steps of (1) traversing across said nodes from said root node by using said key record portion until a bottom node is reached; (2) limiting all but read accesses to the node being traversed and a previously accessed node, to other concurrent transactions; (3) identifying said key record in said bottom node; (4) limiting all but read accesses to said key record; (5) removing all access limitations to traversed nodes; (6) fetching key record data; and (7) removing the access limitation to the key record after the record data has been fetched. Further, methods for inserting and deleting record keys are provided. Additionally, a method for changing the index tree structure while allowing concurrent accesses to take place is provided.
This method has been implemented in DB2/2 and DB2/6000.
With current systems, some important complex queries may take days to complete because of: (1) the volume of data to be processed, (2) limited aggregate resources. Introducing parallelism addresses the first problem. Cheaper, but powerful computing resources solve the second problem. According to a survey by Brodie (presented at the ACM-SIGMOD International Conference on Management of Data, Chicago, May 1988), only 10% of computerized data is in data bases. This is an argument for both more variety and volume of data to be moved into data base systems. We conjecture that the primary reasons for this low percentage are that data base management systems (DBMSs) still need to provide far greater functionality and improved performance compared to a combination of application programs and file systems. This paper addresses the issues and solutions relating to intra-query parallelism in a relational DBMS supporting SQL. Instead of focusing only on a few algorithms for a subset of the problems, we provide a broad framework for the study of the numerous issues that need to be addressed in supporting parallelism efficiently and flexibly. We also discuss the impact that parallelization of complex queries has on short transactions which have stringent response time constraints. The pros and cons of the shared nothing, shared disks and shared everything architectures for parallelism are enumerated. The impact of parallelism on a number of components of an industrial-strength DBMS are pointed out. The different stages of query processing during which parallelism may be gainfully employed are identified. The interactions between parallelism and the traditional systems' pipelining technique are analyzed. Finally, the performance implications of parallelizing a specific complex query are studied. This gives us a range of sample points for different parameters of a parallel system architecture, namely, I/O and communication bandwidth as a function of aggregate MIPS.
This paper presents a novel and simple method, called Commit_LSN, for determining if a piece of data is in the committed state in a transaction processing system. This method is a much cheaper alternative to the locking approach used by the prior art for this purpose. The method takes advantage of the concept of a log sequence number (LSN). In many systems, an LSN is recorded in each page of the data base to relate the state of the page to the log of update actions for that page. Our method uses information about the LSN of the first log record (call it Commit_LSN) of the oldest update transaction still executing in the system to infer that all the updates in pages with page_LSN less than Commit_LSN have been committed. This reduces locking and latching. In addition, the method may also increase the level of concurrency that could be supported. The Commit_LSN method makes it possible to use fine-granularity locking without unduly penalizing transactions which read numerous records. It also benefits update transactions by reducing the cost of fine-granularity locking when contention is not present for data on a page. We discuss in detail many applications of this method and illustrate its potential benefits for various environments. In order to apply the Commit_LSN method, extensions are also proposed for those systems in which (1) LSNs are not associated with pages (AS/400, SQL/DS, System R), (2) LSNs are used only partially (IMS), and/or (3) not all objects' changes are logged (AS/400, SQL/DS, System R).
Commit_LSN has been implemented in DB2 V3 and MQSeries for MVS/ESA (Message Queue Manager/ESA).
This paper presents a method, called ARIES/KVL (Algorithm for Recovery and Isolation Exploiting Semantics using Key-Value Locking), for concurrency control in B+ tree indexes. A transaction may perform any number of nonindex and index operations, including range scans. Both serializable (repeatable read) and, optionally, nonserializable (cursor stability) executions of transactions are supported. The concurrent executions permitted by the locking protocols are such that correct logging and recovery are made possible. ARIES/KVL supports very high concurrency during tree traversals, structure modifications, and other operations. Unlike in System R, when one transaction is waiting for a lock on a key value in a page, reads and modifications of that page by other transactions are allowed. Further, transactions that are rolling back will never get into deadlocks. ARIES/KVL, by also using, for key value locking, the IX and SIX lock modes that were intended originally for table level locking, is able to better exploit the semantics of the operations to improve concurrency, compared to the System R index protocols. These techniques are also applicable to the concurrency control of the classical links-based storage and access structures which are beginning to appear in modern systems also.
Some of the ARIES/KVL techniques are implemented in SQL/DS and the VM Shared File System.
Object-oriented data base management systems (OODBMSs) have been the focus of intense research and commercial development activities in the last few years. Yet, many problems remain to be solved. In this paper, we discuss some aspects of supporting the object-oriented paradigm. In particular, we deal with type and collection hierarchies, complex objects, constraints and field-level types. We argue in favor of modifying a relational data base management system in order to support OO features rather than building an OODBMS from scratch. In so doing, we have tried to benefit from the implementation and user experiences gained from the different implementations of the relational model. We also discuss some issues that are not currently being given as much importance by the OODBMS developers as they deserve.
This paper presents a method, called ARIES-RRH (Algorithm for Recovery and Isolation Exploiting Semantics with Restricted Repeating of History), which is a modified version of the ARIES transaction recovery and concurrency control method introduced by Mohan, et al. in the IBM Research Report RJ6649 and implemented, to varying degrees, in IBM's OS/2 Extended Edition Database Manager, DB2 V2, Workstation Data Save Facility/VM, Starburst and QuickSilver, in Transarc's Encina Product Suite, and in the University of Wisconsin's EXODUS and Gamma data base machine. ARIES redoes, during restart after a system failure, all updates which had been logged to stable storage but whose effects on the data base pages had not yet been reflected in nonvolatile storage before the failure. This repeating history paradigm of ARIES includes redoing the updates of even the transactions which are to be rolled back later in the undo pass of restart. The latter may lead to some wasted work being done. It was pointed out in the ARIES paper that repeating history was required to support fine-granularity (i.e., less than page-granularity) locking. This paper further analyzes this paradigm and proposes more efficient handling of redos, especially when the smallest granularity of locking is not less than a page, by combining the paradigm of selective redo from DB2 V1. Even with fine-granularity locking, it is not always the case that all the unapplied but logged changes need to be redone. ARIES-RRH, which incorporates these changes, still retains all the good properties of ARIES - avoiding undo of undos, single pass media recovery, nested top actions, etc. In this paper, we also explain the fundamentals behind why DB2 V1's selective redo works, in spite of failures during restart recovery.
Many commercial relational data base systems provide two join methods: 1) Nested Loop join and 2) Sort Merge join. Nested Loop join exploits indexes on the inner table's join column. Sort Merge join benefits from the efficiency of bulk sequential disk accesses. Both provide good performance when selected correctly by the query optimizer. However, if incorrectly selected, Nested Loop join's cost could be prohibitive because of the large number of synchronous I/Os against the inner table's data and index pages, especially when the index is non-clustered. Sort Merge join suffers from the inability to fully apply the join predicate before sorting both the relations being joined. The time to spin the disk to access all rows of both the relations and to sort them after applying local predicates could be quite large. Our new method, called Hybrid join, first sorts the outer table on the join column. Then, the outer is joined with the index on the join column of the inner. The inner tuple is represented by its surrogate, equivalent of its physical disk address, which is carried in the index. The partial join result is sorted on the surrogate and then the inner table is accessed sequentially to complete the join result. Local predicate filtering can also be applied before the access of the inner relation through index AND/ORing. Hybrid join takes advantage of join predicate filtering through the index, efficient disk accesses by employing sequential access, and local predicate filtering through index AND/ORing. We discuss details of the Hybrid join algorithm, its prototype implementation in DB2, modeling and validation via measurements. We also discuss the parallel execution of the Hybrid Join algorithm and an efficient algorithm for the inner table index scan.
Hybrid join has been implemented in DB2 V2R3.
This paper proposes schemes for fast page transfer between transaction system instances in a shared disks (SD) environment where all the sharing instances can read and modify the same data. Fast page transfer improves transaction response time and concurrency because one or more disk I/Os are avoided while transferring a page from a system which modified it to another system which needs it. The proposed methods work with the steal and no-force buffer management policies, and fine-granularity (e.g., record) locking. For each of the page-transfer schemes, unlike most of the papers in the literature, we present both recovery and coherency-control protocols in a comprehensive fashion. Updates can be made to a page by several systems before the page is written to disk. Many subtleties involved in correctly recovering such a page in the face of single system or complex-wide failures are also discussed. Assuming that each system maintains its own log, some methods require a merged log for restart recovery while others don't. Techniques for enhancing data availability when one or more systems have failed are also presented. Our proposals should also apply to distributed, recoverable file systems and distributed virtual memory in the SD environment, and to the currently popular client-server object-oriented DBMS environments where the clients cache data.
Alert is an extension architecture designed for transforming a passive SQL DBMS into an active DBMS. The salient features of the design of Alert are reusing, to the extent possible, the passive DBMS technology, and making minimal changes to the language and implementation of the passive DBMS. Alert provides a layered architecture that allows the semantics of a variety of production rule languages to be supported on top. Rules may be specified on user-defined as well as built-in operations. Both synchronous and asynchronous event monitoring are possible. This paper presents the design of Alert and its implementation in the Starburst extensible DBMS.
In this paper, we consider the problems arising from very frequent update accesses to some data (hot spots) in the shared disks transaction environment. We present many techniques for increasing concurrency. These techniques take advantage of the semantics of the operations being performed on such data. A few of the features of our techniques include storing some user data in the global lock manager, centralized updating of data on disk by using log records generated by different systems and avoiding locking completely for certain types of operations. We also address the implications of different types of failures and recovery from them.
In this paper, we argue the importance of and need for taking into consideration concurrency control related issues in making query optimization and query processing decisions. Such considerations are very important not only for attaining good performance, but also for assuring the correctness of the results returned to the users under certain circumstances. Some of the topics that we deal with include degrees of consistency or isolation levels (repeatable read, cursor stability, ...), lock escalation, blocking of results and use of multiple indexes for a single table access (i.e., index AND/ORing). We identify some of the pieces of information relating to locking that must be available to the optimizer for it to make intelligent decisions. We also identify some situations in which locking can be avoided by taking advantage of the isolation level of the query being executed.
This paper attempts to document some of the shortcomings of the optimistic concurrency control (OCC) approach in supporting all the features expected in a full-function DBMS. Surprisingly, in spite of OCC having been around for a long time and its performance having been studied in various contexts, no complete system design, let alone a full-blown implementation, exists, as far as the author knows. The problems with OCC relate to support for access paths (indexes, hash-based storage), partial rollbacks, nested transactions, fine-granularity (e.g., record-level) conflict checking, different isolation levels, distributed transactions, and so on. The goal of this paper is to increase awareness of the implementation aspects of OCC amongst researchers and to initiate a debate about the practical utility of OCC.
In this paper, we present a simple and efficient method, called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, fine-granularity (e.g., record) locking and recovery using write-ahead logging (WAL). We introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logging is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks. We deal with a variety of features that are very important in building and operating an industrial-strength transaction processing system. ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e.g., increment/decrement) which exploit the semantics of the operations and which require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports varying length objects efficiently. By enabling parallelism during restart, page-oriented redo and logical undo, it enhances concurrency and performance. We show why some of the System R paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. We compare ARIES to the WAL-based recovery methods of DB2, IMS and Tandem systems. ARIES is applicable not only to data base management systems but also to persistent object-oriented languages, recoverable file systems and transaction-based operating systems.
ARIES has been implemented, to varying degrees, in IBM's OS/2 Extended Edition Database Manager, DB2/2, DB2/6000, DB2, AdStar Distributed Storage Manager (ADSM), MQSeries for MVS/ESA (Message Queue Manager/ESA), Workstation Data Save Facility/VM, Starburst and QuickSilver, in Transarc's Encina, and in the University of Wisconsin's EXODUS, Gamma and SHORE.
This paper describes a technique for use when multiple instances of a data base management system (DBMS), each with its own cache (buffer pool), can directly read and modify any data stored on a set of shared disks. Global locking and coherency control protocols are necessary in this context for assuring transaction consistency and for maintaining coherency of the data cached in the multiple caches. The coordination amongst the systems is performed by a set of local lock managers (LLMs) and a global lock manager (GLM). This typically involves sending messages. We describe a technique, called LP locking, which saves locking calls when the granularity of locking by transactions is the same as the granularity of caching by the cache manager. The savings are gained by making the LLMs hide from the GLM the distinction between a transaction lock, called the L lock, and a cache-ownership lock, called the P lock, for the same object. The L and P locks for an object, though distinct at an LLM, are known as a single lock at the GLM. An LLM can grant an L or P lock request on an object locally if the combined lock mode of the L and P locks already held on that object by that LLM is equal to or higher than the requested mode. Such optimizations save messages between the LLMs and the GLM. Our ideas apply also to the client-server environment which has become very popular in the OODBMS area and to the distributed shared memory environment.
There are many real-life data base management problems which haven't received as much attention as they deserve from the data base research community. A significant number of these problems relate to supporting the efficient and flexible storage, maintenance and manipulation of large volumes of data (e.g., >100 gigabytes of data in a single table). High availability is also an important consideration. While a classical DBMS like IMS Fast Path exhibits some of these desired features, the currently-popular relational DBMSs have been very slow in providing such support. To make it possible for relational DBMSs to be deployed for managing many large enterprises' operational data and to permit ad hoc querying and knowledge mining, these features are very crucial. We discuss some of the issues involved in improving the availability and efficient accessibility of partitioned tables via parallelism, fine-granularity locking, transient versioning and partition independence. We outline some solutions that have been proposed. These solutions relate to algorithms for index building, utilities for fuzzy dumps, recovery and reorganization, buffer management, transient versioning, concurrency control and record management. Algorithms of this nature are extremely important to produce industrial-strength DBMSs.
Even though concurrency in search structures (e.g., B+ tree indexes) has been discussed frequently in the literature, the problem of providing recovery from transaction and system failures when transactions consist of multiple search structure operations has received very little attention. This paper attempts to provide a comprehensive treatment of index management in transaction systems. We present a method, called ARIES/IM (Algorithm for Recovery and Isolation Exploiting Semantics for Index Management), for controlling concurrency and logging changes to index data stored in B+ trees. ARIES/IM is based on the ARIES transaction recovery and concurrency control method which was introduced by Mohan et al. and which has been implemented, to varying degrees, in IBM's OS/2 Extended Edition Database Manager, DB2/2, DB2/6000, DB2, AdStar Distributed Storage Manager (ADSM), MQSeries for MVS/ESA (Message Queue Manager/ESA), Workstation Data Save Facility/VM, Starburst and QuickSilver, in Transarc's Encina Product Suite, and in the University of Wisconsin's EXODUS extensible DBMS and Gamma data base machine. ARIES/IM supports transaction semantics for locking (repeatable read or degree 3 consistency) and uses write-ahead logging (WAL) for recovery. A transaction may consist of any number of index and nonindex operations. ARIES/IM supports very high concurrency by (1) not locking the index data per se (i.e., keys), (2) locking the underlying record data in data pages only (e.g., at the record level), (3) not acquiring commit duration locks on index pages even during index structure modification operations (SMOs) like page splits and page deletions, (4) allowing retrievals, inserts, and deletes to go on concurrently with even an SMO, and (5) optionally, supporting degree 2 consistency of locking (Cursor stability). During restart, any necessary redos of the index changes are always performed in a page-oriented fashion (i.e., without traversing the index tree) and, during normal processing and restart, undos are performed in a page-oriented fashion, whenever possible.
A subset of ARIES/IM has been implemented in the OS/2 Extended Edition Database Manager, DB2/2 and DB2/6000. Since the locking ideas of ARIES/IM have general applicability, some of those ideas have also been incorporated in SQL/DS and the VM Shared File System even though those systems are based on System R which uses the shadow-page technique for recovery.
As relational DBMSs become more and more popular and as organizations grow, the sizes of individual tables are increasing dramatically. Unfortunately, current DBMSs do not allow updates to be performed on a table while an index (e.g., a B+ tree) is being built for that table, thereby decreasing the systems' availability. This paper describes two algorithms in order to relax this restriction. Our emphasis has been to maximize concurrency, minimize overheads and cover all aspects of the problem. Builds of both unique and nonunique indexes are handled correctly. We also describe techniques for making the index-build operation restartable, without loss of all work, in case a system failure were to interrupt the completion of the creation of the index. In this connection, we also present algorithms for making a long sort operation restartable. These include algorithms for the sort and merge phases of sorting.
We present efficient and flexible methods which permit read-only transactions that do not mind reading a possibly slightly old, but still consistent, version of the data base to execute without acquiring locks. This approach avoids the undesirable interferences between such queries and the typically shorter update transactions that cause unnecessary and costly delays. Indexed access by such queries is also supported, unlike by the earlier methods. Old versions of records are maintained only in a transient fashion. Our methods are characterized by their flexibility (number of versions maintained and the timing of version switches, supporting partial rollbacks, and different recovery and buffering methods) and their efficiency (logging, garbage collection, version selection, and incremental, record-level versioning). Distributed data base environments are also supported, including commit protocols with the read-only optimization. We also describe efficient methods for garbage collecting unneeded older versions.
This paper presents solutions for the problem of performing recovery correctly in shared disks (SD) and client-server (CS) architectures. In SD, all the disks containing the data bases are shared amongst multiple instances of the DBMS. Any DBMS instance in the complex may directly access and update any data. Each instance maintains its own log and buffer pool. The local logs are later merged for media recovery purposes and, possibly, for restart recovery purposes. In CS, the server manages the disk version of the data base. The clients, after obtaining data base pages from the server, cache them in their buffer pools. Clients perform their updates on the cached pages and produce log records. The log records are buffered locally in virtual storage and later sent to the single log at the server. In write-ahead logging (WAL) systems, a monotonically increasing value called the log sequence number (LSN) is associated with each log record. Every data base page contains the LSN of the log record describing the most recent update to that page. This is required for proper recovery after a system failure. We describe a technique with some valuable features (e.g., avoiding reading empty pages and supporting the Commit_LSN optimization) for generating monotonically increasing LSNs in SD and CS architectures without using synchronized clocks.
A method and apparatus for concurrent modifications of an index tree in a transaction processing system. The index tree includes at least one root node having a key record reference to one or more nodes in a next lower ordered level and at least one bottom node providing access to key records. Transactions including a structure modification operation are performed by traversing the index tree to the selected node and then setting an indication of the pendency of a structure modification operation. Concurrent key record inserts or deletes are permitted throughout the index tree where no indication of a pending structure modification operation is present and are delayed where a pending structure modification operation is indicated. Similarly, transactions which include a key record delete may require a structure modification operation in the event the transaction does not reach new point of consistency and must be undone. Therefore, an indication of each key record delete which has not yet reached a new point of consistency is set and concurrent key record inserts or deletes are also delayed until the possibility of a structure modification operation is completed.
This method has been implemented in DB2/2 and DB2/6000.
Change processing of a replica database is accomplished by separating redo records obtained from the transaction log of a primary database into respective queues. The redo records are separated such that all transaction records for a unit of transfer (page) of the primary database are placed on the same queue in log sequence. Each queue is linked exclusively to one of a plurality of parallel queue servers. Each queue server applies to the replica database the redo records in the queues which it exclusively serves. The replica database is thereby made consistent with the primary data by a lock-free updating mechanism which services the pages of the replica database in parallel.
This method has been implemented as part of the Remote Site Recovery (RSR) feature of IMS/ESA V5.
For high availability in disaster recovery situations, it is desirable to reflect data base changes made by a transaction processing system continuously on a (replicated) data base maintained at a remote site. To be able to make the backup system take over as soon as possible and to keep the resource consumption on the backup system low, we describe a general method for exploiting parallelism in the processing of the log records received at the backup system. As much as possible, this method tries to take advantage of the existing code for performing restart recovery after a failure of the transaction system. The techniques described here are very general in that they could also be used in the context of restart and media recovery of the primary system to improve the time it takes to complete such processing. We also propose techniques for (1) checkpointing the state of the backup system so that recovery can be performed quickly in case the backup system fails and (2) allowing new transaction activity to begin even as the backup is taking over the role of the primary when the old primary fails. Our approach is general enough to accommodate even the ARIES-type recovery and concurrency control methods which support high concurrency and high efficiency via write-ahead logging, nested transactions, operation logging and semantically-rich modes of locking. We also discuss some problems relating to distributed transactions, the shared disks (data sharing) transaction environment, and combining executions of 1-Safe and 2-Safe transactions in a single system. We propose some possible solutions for dealing with these problems.
The parallel redo method for the backup has been implemented as part of the Remote Site Recovery (RSR) feature of IMS/ESA V5.
Larson has proposed a dynamic hashing algorithm called Linear Hashing with Separators (LHS) that, given a unique primary key value, uses a table in memory to allow the retrieval of the corresponding record in the file in one page access to secondary storage. Larson considers LHS to be the first practical method offering one-access retrieval for large dynamic files. He did not discuss the impact of concurrent operations by different users, some of whom are reading the file while others are performing operations like inserts, deletes, updates, file expansions or file contractions which can cause relocations of records. We present a method, called ARIES/LHS (Algorithm for Recovery and Isolation Exploiting Semantics for Linear Hashing with Separators), for controlling such concurrent operations with fine-granularity (e.g., record) locking, while guaranteeing serializability. ARIES/LHS prevents rolling back transactions from getting involved in deadlocks. It also includes recovery techniques for handling transaction and system failures, while allowing multiple operations in each transaction.
An atomic commit protocol can ensure that all participants in a distributed transaction reach consistent states, whether or not system or network failures occur. The atomic commit protocol used in industry and academia is the well-known two-phase commit (2PC) protocol, which has been the subject of considerable work and technical literature for some years. Much of the literature focuses on improving performance in failure cases by providing a nonblocking 2PC that streamlines recovery processing at the expense of extra processing in the normal case. We focus on improving performance in the normal case based on two assumptions: first that the networks and systems are becoming increasingly reliable, and second that the need to support high-volume transactions requires a streamlined protocol for the normal case. In this paper, various optimizations are presented and analyzed in terms of reliability, savings in log-writes, network traffic and reduction in resource lock time. Its unique contributions include the disclosure of a number of new optimizations, the analysis of the optimizations, and a thorough comparison of the different approaches.
We describe an efficient method for supporting incremental and full archiving of data bases (e.g., individual files). Customers archive their data bases quite frequently to minimize the duration of data outage. Because of the growing sizes of data bases and the ever increasing need for high availability of data, the efficiency of the archive copy utility is very important. The method presented here minimizes interferences with concurrent transactions by not acquiring any locks on the data being copied. It significantly reduces disk I/Os by not keeping on data pages any extra tracking information in connection with archiving. These features make the archive copy operation be more efficient in terms of resource consumption compared to other methods. The method is also flexible in that it optionally supports direct copying of data from disks, bypassing the DBMS's buffer pool. This reduces buffer pool pollution and processing overheads, and allows the utility to take advantage of device geometries for efficiently retrieving data. We also describe extensions to the method to accommodate the multisystem shared disks transaction environment. The method tolerates gracefully system failures during the archive copy operation.
This paper very briefly summarizes the features and technologies implemented in the IBM relational DBMS products. The topics covered include record and index management, concurrency control and recovery methods, commit protocols, query optimization and execution techniques, high availability and support for parallelism and distributed data. Some indications of likely future product directions are also given.
The outer join operation is being introduced in major relational DBMSs, and is already proposed as part of the emerging SQL2 standard. The outer join operation plays an important role in the handling of complex object queries, path expressions in object-oriented query languages, and universal and existential subqueries in relational languages. Thus, outer join is an important primitive. Similar to regular joins, good performance of outer join is critical in query processing. In this paper, we introduce a unified algorithm for handling of outer join and subqueries. We first introduce a general execution algorithm for an outer join of any complexity, as required by the SQL2 standard, where the outer join predicate can be a full WHERE clause of SQL, including conjuncts, disjuncts, subqueries, etc. Typically, outer join queries are much simpler, e.g., involving equijoins. For this category of queries we introduce, in considerable detail, a much more efficient algorithm, called the specialized algorithm. One important property of these algorithms is that they allow several existing efficient join algorithms to be extended to support outer join. We also present methods for parallelizing the execution of an outer join. We then refine these methods for handling SQL subqueries.
We present a cost-effective method for improving data availability during restart recovery of a data base management system (DBMS) after a failure. The method achieves its objective by enabling the processing of new transactions to begin even before restart recovery is completed by exploiting the Commit_LSN concept. It supports fine-granularity (e.g., record) locking with semantically-rich lock modes and operation logging, partial rollbacks, write-ahead logging, and the steal and no-force buffer management policies. The overhead imposed by this method during normal transaction processing is insignificant. We require very few changes to an existing DBMS in order to support our method. Our method can be implemented with different degrees of sophistication depending on the existing features of a DBMS.
Results of a relational data base management system are joined in a process requiring, first, existence of an index on the join column of the inner table, and, second, ordering on the join columns of the first table. First, the index on the inner table's join column is scanned for rows of the inner table having join column values matching such values of rows in the outer table. This is done in a single pass through the outer table. Next, a temporary work table containing the identifiers of inner table rows having join column values matching those of the outer table is produced by concatenating the row identifiers to their matching outer table rows. Following this, the temporary work table is ordered by the identifiers. Last, the identifier list of inner table rows is used to retrieve the corresponding rows of the inner table. All predicates local to the inner table are applied to the retrieved rows, and those satisfying these local predicates are combined with their matching outer table rows and returned to the user.
This join method has been implemented in DB2 V2R3.
Apparatus and method for reading data pages in a transaction processing system without locking the pages are disclosed. The system maintains a Global_Committed_LSN identifying the oldest uncommitted transaction accessing any of the data, and Object_Committed_LSNs identifying the oldest uncommitted transactions accessing particular files, tables and indexes. Each data page includes a Page_LSN identifying the last transaction to have updated the page. To read a page, a transaction first latches the page, and compares the page's Page_LSN with the Global-Committed_LSN, or with the page's respective Object_Committed_LSN. If the Page_LSN is older than the Committed_LSN with which it was compared, then the transaction reads the page without locking it, since there can be no uncommitted transaction in process which might have updated the page's data. However, if the Page_LSN is younger than the Committed_LSN, the page is locked before being read.
This method has been implemented in DB2 V3 and MQSeries for MVS/ESA (Message Queue Manager/ESA).
Disclosed is a new way to determine if data is in the committed state without locking through the use of commit bits. Under certain conditions, transactions can easily determine that the data is committed simply by checking the value of a bit.
This method has been implemented in DB2/390 V3.
A number of interesting problems arise in supporting the efficient and flexible storage, maintenance and manipulation of large volumes of data (e.g., >100 gigabytes of data in a single table). Very large tables are becoming common. Typically, high availability is an important requirement for such data. The currently-popular relational DBMSs have been very slow in providing the needed support. To make it possible for RDBMSs to be deployed for managing many large enterprises' operational data and to support complex queries efficiently, these features are very crucial. We discuss some of the issues involved in improving the availability and efficient accessibility of partitioned tables via parallelism, fine-granularity locking, transient versioning and partition independence. We outline some solutions that have been proposed. These solutions relate to algorithms for index building, utilities for fuzzy backups, incremental recovery and reorganization, buffer management, transient versioning, concurrency control and record management.
A method of reducing the number of messages required for sync point (commit or backout) operations by leaving out nodes that have not participated in the corresponding transaction. A two-phase sync point protocol is used in a distributed transaction processing network to commit or backout transactions in the network. In response to the beginning of sync point operations on a transaction x, each node determines if each of its partner nodes stated on the sync point operation for transaction x-1 that the partner could be left out of sync point operations for transaction x. If a partner node did so state that it could be left out, the present node determines if the partner node was included by the present node during the present transaction x. If the partner node was not included during the present transaction, the present node excludes the partner node from the present sync point operations.
It is useful for an overall understanding of the invention to summarize the general rules first. It is not okay for a node to be left_out of sync_point operations if a partner node is also left_out. This can create a permanent lock out. Therefore, if a child node says to its parent that it is okay for the parent to leave_out the child, then the parent cannot also tell the child that it is okay to leave_out the parent.
This method is now part of the SNA LU6.2 support for the presumed abort commit protocol. It has been implemented in DB2 V3.
A method of controlling entry of a block of data is used with a high-speed cache which is shared by a plurality of independently-operating computer systems in a multi-system data sharing complex. Each computer system has access both to the high-speed cache and to lower-speed, upper-level storage for obtaining and storing data. Management logic in the high-speed cache assures that the block of data entered into the cache will not be overwritten by an earlier version of the block of data obtained from the upper-level storage.
This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2 V4R1.
In a multi-system data sharing complex, a database system can write updated pages to a shared electronic store for a fast write. Other database systems can obtain pages written to the shared store for further modification without the pages first being written to stable storage. However, pages are eventually written to the stable storage in a castout process. Recovery of a database from failure of the shared store is bounded by determination of a recovery boundary which, when applied to the union of database system transaction logs, establishes a point in front of which are found log records of modifications to pages which were in the shared store when it failed. These log records are applied to page versions obtained from stable storage to recover from failure of the shared store.
This method is part of the S/390 Parallel Sysplex Coupling Facility and DB2 V4R1.
An improved concurrency control system for application to a distributed concurrent transaction and query processing system using multi-version database records to overcome delays arising from lock conflicts. Read-only queries are afforded a consistent "stable state" of the database during the life of the query. Updating transactions requiring locks can proceed without waiting for the termination of long queries. At least two database versions are necessary, although availability of more versions permits long read-only queries to phase-out over time without forcing new queries to use aged "stable-state" data and without roll-back. Read-only queries can be terminated and converted to locking transactions to permit an update of the "stable-state" database version before the queries would normally terminate. A novel record key structure having a plurality of substructures corresponding to the several database versions is used to access database records. Rapid selection of proper record version and efficient version tracking and updated is effected using several bit-mapped transaction index tables.
A high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex. Each data system has access both to the high-speed cache and the lower-speed, secondary storage for obtaining and sharing data. Management logic in the high-speed cache assures that a block of data obtained from the cache for entry into the secondary storage will be consistent with the version of the block of data in the shared cache.
This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2 V4R1.
We present several methods which relate to space management in a transaction system supporting fine-granularity (e.g., record) locking. These methods enable varying length records to be supported efficiently by permitting garbage collection to be performed within a page without the moved records having to be locked or the movements having to be logged. We present methods to do the following: (1) When a transaction releases space, efficiently prevent that space from being consumed by other transactions until that transaction terminates, while allowing the same transaction to reuse the space it freed. (2) Under the correct circumstances, avoid reading a totally empty deallocated page from disk during page reallocation. (3) Updating and logging of free space inventory pages' (FSIPs') changes for correct recovery. (4) Reduce locking during a table scan by a transaction using the isolation level of cursor stability. Our methods improve concurrency and space utilization, and provide I/O and CPU savings. Our space reservation and FSIP logging methods have been implemented in DB2 V3 in preparation for DB2's support of record locking.
This paper briefly summarizes some recent IBM work in the areas of distributed commit protocols, and recoverable messaging and queuing. We discuss the original Presumed Nothing commit protocol of SNA LU 6.2 and the current industry standard Presumed Abort (PA) protocol which we originally developed in IBM's R* project. We also discuss Generalized Presumed Abort (GPA) which resulted from the integration of PA into LU 6.2. GPA has been implemented in DB2 V3. We provide a brief introduction to the Message Queue Interface (MQI), an architected application programming interface, and Message Queue Manager (MQM) MVS/ESA, one of the IBM MQSeries products that implements MQI. Some internal design features of MQM are also described.
In order to provide realtime responses to complex queries involving large volumes of data, it has become necessary to exploit parallelism in query processing. This paper addresses the issues and solutions relating to intra-query parallelism in a relational DBMS. We provide a broad framework for the study of the numerous issues that need to be addressed in supporting parallelism efficiently and flexibly. The different stages of query processing during which parallelism may be gainfully employed are identified. Parallelism can be exploited for both CPU and I/O activities. As a first step in this direction, I/O parallelism has been introduced in DB2 V3R1. We describe many aspects of the DB2 implementation. These include compile time as well as run time decisions, especially regarding the degree of parallelism. At compile time, the best sequential plan produced by the preexisting optimizer logic is made parallel by the new "post-optimizer" logic in DB2 V3. If necessary, the degree of parallelism so determined is adjusted at run time based on host variable values and resource availability.
This invention relates to a method for returning a log-based, transaction-oriented system to a transaction consistent state following a failure. More particularly, this invention relates to methods which permit the effects of a single transaction to be partially or fully annulled during normal system operation. In this regard, transactions are delimited at the user level by BEGIN, COMMIT, or ABORT primitives. With respect to the transaction log, UNDO and REDO information from the log is used to define and control both system recovery and partial or complete transaction rollback. More particularly, the method of the invention relates to transaction-oriented systems of the type which support concurrent execution of multiple transactions, and further of the type permitting fine-grained concurrency control mechanisms and consequent overlapping of transaction execution.
This method (ARIES) has been implemented in OS/2 Extended Edition Data Base Manager, DB2 DB2 V3 and V4, DB2/2, DB2/6000, Workstation Data Save Facility/VM, ADSTAR Distributed Storage Manager (ADSM), MQSeries for MVS/ESA (Message Queue Manager/ESA), Starburst extensible DBMS, QuickSilver distributed operating system, Transarc's Encina Product Suite, and University of Wisconsin's Gamma and EXODUS DBMSs, and SHORE persistent object system.
This paper presents an algorithm, called ARIES/CSA (Algorithm for Recovery and Isolation Exploiting Semantics for Client-Server Architectures), for performing recovery correctly in client-server (CS) architectures. In CS, the server manages the disk version of the database. The clients, after obtaining database pages from the server, cache them in their buffer pools. Clients perform their updates on the cached pages and produce log records. The log records are buffered locally in virtual storage and later sent to the single log at the server. ARIES/CSA supports write-ahead logging (WAL), fine-granularity (e.g., record) locking, partial rollbacks and flexible buffer management policies like steal and no-force. It does not require that the clocks on the clients and the server be synchronized. Checkpointing by the server and the clients allows for flexible and easier recovery.
A fast technique for transferring units of data between transaction systems in a shared disk environment. The owning system, having updated the page, generates a version number for the page which is stored with a lock possessed by the owning system. When a requesting system seeks a record on the page, its request for a lock illicit an an indication that a more recent version of the page is required in the local memory. The buffer management component of a DBMS, with assistance from the lock management, triggers a memory to memory transfer of the page from the owning DBMS to the requesting DBMS using a low overhead communication protocol. The transfer of page is without disk I/O or the log I/O for the updates made to the page.
Enhanced data availability occurs in a write-ahead logging, transaction-oriented database system by permitting new transactions to acquire access to data while restart recovery operations are proceeding. The invention permits new transactions to acquire access to data during restart recovery UNDO processing on the condition that the last update to the data occurred before a commit point measured by the earliest-commencing transaction with uncommitted updates which was still executing when a system failure initiated restart recovery operations. During REDO processing, a transaction is permitted access to data which, in addition to meeting the commit point condition, is not in a data structure subject to the REDO processing.
Workflow management deals with the coordinated execution of business processes. These processes are often of long duration, weeks or even months, are very large, involving many agents and distributed resources, and critical to the enterprise. As organizations adopt workflow technology, they become increasingly dependent on the system to carry on their daily business activities. Hence, a workflow management system, WFMS, must provide availability, reliability and scalability to cope with environments where the range of potential failures is very broad, from system failures to semantic failures. In this paper, in the context of IBM's Exotica project, we propose a new architecture to address the availability and scalability issues, and to deal with system failures such as site and communication failures. We also discuss how features of advanced transaction models can be incorporated into workflow modeling to handle semantic failures. These solutions enhance the capabilities of workflow systems and make them more adequate for large enterprises.
In several commercial database management systems, the shared disks (SD) architecture has been chosen to provide increased processing capacity and availability of data. In SD, any instance of a DBMS executing on a node of a cluster of loosely coupled processors can access all the data in the database. To coordinate accesses to the same data from the different systems, these instances use global locks. In addition, they implement a buffer coherency control protocol so that the current version of data is available locally when needed. It is important to reduce significant overheads associated with global locking and buffer coherency while supporting high concurrency. In this paper, we describe locking and latching techniques which would be used for synchronizing accesses to hot spot shared data structures in a typical relational DBMS in the SD environment. In particular, we describe synchronization techniques for operations involving B+-tree indexes and for updating space map pages. Many of these techniques have been implemented in Version 4 of DB2 for MVS/ESA. Our techniques are also applicable to the client-server context with client caching of data, as is done in many OODBMSs.
The hcC-tree is an index structure for multiple sets (on a common attribute) which are useful in object oriented databases as well as relational databases. This paper focuses on the problem of designing concurrency control and recovery algorithms for the hcC-tree. This work is based on ARIES/KVL and ARIES/IM which provide ACID properties for transactions containing multiple index operations on a B+ tree. The hcC-trees, though based on B+ trees, are significantly different from B+ trees. In this paper, we propose an algorithm that provides high concurrency while still retaining the performance advantages of a hcC-tree.
We discuss several disk read-write optimizations that are implemented in different transaction systems and disk hardware to improve performance. These include: (1) When multiple sectors are written to disk, the sectors may be written out of sequence (SCSI disk interfaces do this). (2) Avoiding initializing pages on disk when a file is extended. (3) Not accessing individual pages during a mass delete operation (e.g., dropping an index from a file which contains multiple indexes). (4) Permitting a previously deallocated page to be reallocated without the need to read the deallocated version of the page from disk during its reallocation. (5) Purging of file pages from the buffer pool during a file erase operation (e.g., a table drop). (6) Avoiding logging for bulk operations like index create.
We consider a system which implements the above optimizations and in which a page consists of multiple disk sectors and recovery is based on write-ahead logging using a log sequence number on every page. For such a system, we present a simple method for guaranteeing the detection of the partial disk write of a page. Detecting partial writes is very important not only to ensure data integrity from the users' viewpoint but also to make the transaction system software work correctly. Once a partial write is detected, it is easy to recover such a page using media recovery techniques. Our method imposes minimal CPU and space overheads. It has been implemented in DB2/6000 and AdStar Distributed Storage Manager (ADSM).
In this paper we present the Exotica research project, currently in progress at the IBM Almaden Research Center. One of the goals of the project is to bring together industrial trends and research issues in the workflow area. It is for this reason that we have focused on a particular commercial product, FlowMark, IBM's workflow product. However, our results are easily generalized to general workflow management systems since FlowMark's model is similar to that proposed by the Workflow Management Coalition. In particular, the paper contains a high-level overview of our research in six specific areas that are not product specific.
A computer-implemented method for minimizing the amount of time to access current data in a database which may be stored wholly in a DASD-oriented external storage subsystem or partly in DASD and partly in a high-speed electronic store while maintaining coherency of the data with respect to multiple user systems.
Current computer and network technologies allow organizations to decentralize resources in ways not foreseeable few years ago. However, cooperative work in a decentralized environment requires tools to hide the complexity generated by heterogeneous and distributed systems. Workflow Management Systems (WFMS) are a first generation of products that attempt to manage the execution of business processes by large numbers of users distributed over a wide area and using heterogeneous resources. They are a very promising venue for collaborative systems but, in most cases, the autonomy of the users is greatly restricted due to architectural and design considerations. This is a severe restriction, especially when considering the emergence of mobile computing, and the increase in use of lap tops and small computers in which connectivity is only occasional. In this paper, we discuss how disconnected workflow clients can be supported while preserving the correctness of the overall execution and allowing coordinated interaction between the different users. Disconnected clients provide a great deal of flexibility to a workflow management system and enhance its resilience to failures.
A method for detecting partial page writes in pages spanning multiple sectors of a sector organized multiple tracked storage facility in a page oriented, log based transaction management system. During a page write to storage from a buffer, a status bit is embedded at the end of each page sector and a status byte in the last page sector, the status byte is complemented, and each status bit is swapped with a counterpart in the status byte as it is being written out to storage. During a page read in the buffer from storage the status bit values of each page are swapped with their byte counterpart and a partial write detected as a mismatch of the bits in the status byte. Page recovery involves recreating a page from said log upon detection of either a partial sector write or a partial page write by redoing all accessing events on the log between a predetermined point to an end of log including unconditionally redoing of all format page events logged in said interval. Partial page write error is also detected where page is allocated to the buffer while avoiding a page read from storage.
This method has been implemented in DB2 Client Server (DB2/6000, DB2/2, ...) and AdStar Distributed Storage Manager (ADSM).
This paper first describes the problems associated with concurrency control and recovery in B+ tree indexes and then presents two methods, called ARIES/KVL (Algorithm for Recovery and Isolation Exploiting Semantics using Key-Value Locking) and ARIES/IM (Algorithm for Recovery and Isolation Exploiting Semantics for Index Management), for solving those problems. These methods allow a transaction to perform any number of nonindex and index operations, including range scans. The concurrent executions permitted by them are such that serializability is guaranteed, and correct logging and recovery based on write-ahead logging are made possible. Compared to the index locking methods of System R and DB2, these methods support higher levels of concurrency during tree traversals, structure modifications, and other operations. By locking individual keys, rather than key values, ARIES/IM is able to support more concurrency than ARIES/KVL. Our methods have been implemented in the DB2 family of products, and in SQL/DS and the VM/ESA Shared File System.
In the past few years there has been an increasing interest in workflow applications as a way of supporting complex business processes in modern corporations. Given the nature of the environment and the technology involved, workflow applications are inherently distributed and pose many interesting challenges to the system designer. In most cases, a client/server architecture is used in which knowledge about the processes being executed is centralized in one node to facilitate monitoring, auditing, and to simplify synchronization. In this paper, we propose a novel distributed architecture, Exotica/FMQM, for workflow systems in which the need for such a centralized database is eliminated. Instead, we use persistent messages as the means to store the information relevant to the execution of a business process. Our approach is to completely distribute the execution of a process so individual nodes are independent. The advantages of this approach are increased resilience to failures and greater scalability and flexibility of the system configuration.
In a partitioned database system of the Shared Nothing type, one or more secondary replicas of each partition are maintained by spooling (i.e., asynchronously sending) modified (usually called dirty) pages from the primary replica to the secondary replica(s) rather than by using a synchronous page update or by sending log entries instead of entire pages. A Write-ahead Log protocol is used so that a dirty page is not forced to non-volatile storage until a log record of the modification is created and written to non-volatile storage. Replica updating does not delay the committing of transactions because replica updating is done asynchronously with respect to transaction processing. Since dirty pages are sent rather than only log entries, disk accesses and processing at the secondary replica(s) arising from the maintaining of the replicas are minimized as well. Only one centrally accessible log is maintained for all replicas of the same partition.
This paper presents an overview of the Exotica project currently in progress at the IBM Almaden Research Center. The project aims at exploring several research areas from advanced transaction management concepts to client/server architectures and mobile computing within the context of business processes and workflow management. The ultimate goal is to incorporate these ideas into IBM's products and prototypes. The project involves IBM groups in Almaden (U.S.A.), Hursley (U.K.), Boeblingen (Germany), and Vienna (Austria). In this paper, we briefly describe two IBM products, FlowMark, a workflow management system, and MQSeries, a messaging system, as the environments in which we are focusing our research. We also discuss some of our results in the areas of availability, replication, distribution, and advanced transaction models, as well as describe our future research directions.
A data processing system for the storage of persistent and non-persistent data in a queue, and a method for the storage of data which is required to survive a system failure (persistent data) and data which is not required to survive a system failure (non-persistent data) on a single queue, are disclosed. The method involves receiving persistent and non-persistent data to be stored in a queue, then marking the data in time sequence order, before storing the persistent data in a first set of data pages and the non-persistent data in a second set of data pages. Upon receiving a request for removal of data from the queue, both the first and second sets of pages are checked and the data is removed in time sequence order. A log is preferably created to enable recovery in the event of failure and restart of the queue. When receiving and removing persistent data to be stored in and to be removed from the queue, log entries are made of changes to the persistent data only. Before the receiving of the data, a table in space map pages is created indicating which pages available in storage are free, which are allocated for persistent data, and which are allocated for non-persistent data. After receiving data and removing data, the table is updated. In the event of a failure and restart of the queue, space map page table is scanned and updated to indicate that all pages containing non-persistent data are free.
This method has been implemented in Message Queue Manager/ESA (MQSeries).
An atomic commit protocol can ensure that all participants in a distributed transaction reach consistent states, whether or not system or network failures occur. The atomic commit protocol used in industry and academia is the well-known two-phase commit (2PC) protocol, which has been the subject of considerable work and technical literature for some years.
Much of the literature focuses on improving performance in failure cases by providing a non-blocking 2PC that streamlines recovery processing at the expense of extra processing in the normal case. We focus on improving performance in the normal case based on two assumptions: first, that networks and systems are becoming increasingly reliable, and second, that the need to support high-volume transactions requires a streamlined protocol for the normal case.
In this paper, various optimizations are presented and analyzed in terms of reliability, savings in log writes and network traffic, and reduction in resource lock time. The paper's unique contributions include the description of some optimizations not described elsewhere in the literature and a systematic comparison of the optimizations and the environments where they cause the most benefit. Furthermore, it analyzes the feasibility and performance of several optimization combinations, identifying situations where they are effective.
Disk check bits refer to bit patterns stored in particular bytes of a page which are used to detect errors in writing the page to storage. Every time a page is obtained from storage, changed from the version retained in storage, and written back to storage, the check bit pattern on the changed page is altered to be different from the bit pattern on the storage page. This is because the changed page overwrites the stored page. The invention provides a method for managing the check bits in a multi-DBMS system employing a high-speed shared electronic store as a store-in cache for all pages obtained from disk storage. When a page is first obtained from disk storage by a DBMS and changed, check bit information for the page is maintained in a directory of the storing cache which indicates what the patterns are for the version of the page in the disk storage. All pages which are modified are stored in the store-in cache and are only returned to disk storage from the cache. Therefore, when a page is to be written to disk storage, the DBMS writing the page to storage processes the check bits on the page itself, changing them as required based on the check bit information stored in the directory for the page.
This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2/MVS V4.
Database files containing records include pages called free space inventory pages (FSIPs) describing field space information relating to data pages. In a transaction processing system, the invention provides correct sequences for logging of updates to FSIPs when the updates are required by updates or UNDOs to data records. If, during operation to insert a data record to a data page, the FSIP containing free space information for the page indicates that the page is empty and there are no uncommitted deletes to the page, page I/O is avoided by formatting the page directly in a data buffer pool without reading the page from disk. During a cursor stability-level table scan with data record-level locking, excessive I/O and some record locking are avoided by using space reservation fields on an FSIP to ensure that there is no space reserved on the data page for a later undo of uncommitted data records deletes from the page.
This method has been implemented in DB2/MVS V3.
A method and means for achieving files of modifiable pages in a log based phased commit transaction management system (TMS) in which those pages which have been modified since the last full or incremental backup do not require during the copy operation any modifications to the page itself but merely to a common status page. This is accomplished by management of a pair of global log sequence numbers. Comparison between a first number (ICBU_LSN) and each data page LSN as the page is modified permits the common status page to be updated to correctly reflect the changed status. Subsequent modifications to the same page do not require amendment of the status page. The status page indicia are reset as part of the backup procedure and for ascertaining the page copy set for incremental copying. The ICBU_LSN assumes one of two values as a function of the copy operation and another value for processing page modifications after the copy operation. A second number (ICRF_LSN) is used in the restoration of a file after the file has been partially restored by a page merge in page number order from full and incremental copies. In this case, the ICRF_LSN defines the point in the log for redo since the most recent copy was made.
This method has been implemented in the ADSTAR Distributed Storage Manager (ADSM) and in DB2/MVS V4.
In transaction processing systems, it is known for resource-updating operations within a transaction to be backed out at the request of an application program following detection of error conditions during processing of the transaction. If the error condition is very likely to recur, it may be undesirable for the operations request to be presented to the application exactly as before. A transaction-oriented data processing system and a method of transaction-oriented data processing are provided in which operation requests or data packets may be marked to be excluded from the effects of application-requested backouts.
This method has been implemented in Message Queue Manager/ESA (MQSeries).
In recent years, numerous transaction models have been proposed to address the problems posed by advanced database applications. A few of these models have been implemented as prototypes but almost none are being used in a commercial product. In this paper, we make the case that such models are too centered around databases to be useful in real environments. Many of the new applications are heterogeneous, both in the supporting platforms and tools involved, and distributed over a wide geographic area. They raise a variety of issues that are not addressed at all by transaction models, which may explain the lack of success of the latter. These same issues, however, are the basis for many existing workflow systems, which are having considerable success as commercial products in spite of not having a solid theoretical foundation. We explore some of these issues and show that, in many aspects, workflow models are a superset of transaction models and have the added advantage of incorporating many ideas that to this date have remained outside the scope of traditional transaction processing.
In this paper we describe the design and implementation of workflow management applications on top of Lotus Notes Release 4. We elaborate on various design issues for Notes workflow applications and introduce Notes Release 4's native workflow concepts like agents, events, macros, LotusScript, OLE2 capabilities, and doclinks, which make Notes a powerful workflow tool. The idea of the paper is the use of the Workflow Reference Model of the Workflow Management Coalition to define structured workflows, and execute these workflows through the exploitation of Notes Release 4's native workflow concepts.
A high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex. Each data system has access both to the high-speed cache and the lower-speed, secondary storage for obtaining and storing data. Management logic and the high-speed cache assures that a block of data obtained from the cache for entry into the secondary storage will be consistent with the version of the block of data in the shared cache with non-blocking serialization allowing access to a changed version in the cache while castout is being performed. Castout classes are provided to facilitate efficient movement from the shared cache to DASD.
This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2/MVS V4.
A method is disclosed for a database system for off-loading, to disk controller, the extraction of committed data. The system first picks a Commit_LSN value and insures all the data modified prior to the Commit_LSN value is processed following the DBMS policy of reducing some disk I/Os or not for the modified pages cached in the system. If the policy is not to do disk I/Os for such pages, then the system places the identifiers of those pages in an ignore list. Otherwise, the system writes those pages to disk and empties the ignore list. Afterwards, the system forwards the ignore list and the Commit_LSN along with information regarding the data to be processed to the controller. The controller performs the off-load function by reading from disk every page identified by the system except those in the ignore list, and determining, for each page, if the page's Page_LSN value is less than the Commit_LSN. If it is, then the controller processes the page and adds any qualifying data from that page to a defined answer set. Otherwise, the controller adds the Page_ID for that page to a defined exception list. The controller then passes the answer set and the exception list to the system. The system processes the pages identified in the exception list and those in the ignore list. The system consolidates these answers with the answer set returned by the controller for presentation to the user.
Workflow management systems (WFMSs) support the modeling, coordinated execution and monitoring of business processes within an organization. In particular, very large workflow management systems are used in organizations where the number of users may be in the tens of thousands, the number of process instances in the hundreds of thousands, and the number of sites in the thousands, all distributed over wide geographic areas. In these environments, failure of the WFMS or the underlying workflow database which stores the meta-information about the processes is not tolerable and hence continuous availability is a key aspect of the system. This paper addresses the problem of providing high availability in workflow management systems by proposing a backup technique which ensures that execution of a process instance can be resumed at any point in time in the event of failures. An essential characteristic of our backup scheme is that it allows the user to define different availability levels in order to avoid high costs for maintaining backups for all process instances. The backup scheme to support the different availability levels is implemented using the workflow semantics, which we believe will --- (i) make it independent of the underlying workflow database, thus permitting the use of heterogeneous databases as primary and backup, (ii) reduce overheads, especially when compared to backup schemes provided by database systems.
Workflow Management Systems (WFMSs) automate the execution of business processes in environments encompassing large number of users distributed over a wide geographic area and using heterogeneous resources. Current implementations allow the definition and controlled execution of complex and long lived business processes as the basis for an enterprise-wide collaborative system but, in most cases, the autonomy of the users is greatly restricted due to architectural and design considerations. In particular, existing systems are built around a centralized server. As a result, users need to maintain an uninterrupted connection with the server to perform the different tasks assigned to them. This is a severe restriction, especially when considering the emergence of mobile computing, and the increase in use of laptops and small computers which are connected to the network only occasionally and which will, undoubtedly, be the tool of choice for many users. This paper addresses the problem of supporting disconnected workflow clients in large workflow management systems while still preserving the correctness of the overall execution and allowing coordinated interactions between the different users regardless of their location.
A method for controlling coherence of data elements sharable among a plurality of independently-operating CPCs (central processing complexes) in a multi-system complex (called a parallel sysplex) which contains sysplex DASDs (direct access storage devices) and a high-speed SES (shared electronic storage) facility. Sysplex shared data elements are stored in the sysplex DASD under a unique sysplex data element name, which is used for sysplex coherency control. Any CPC may copy any sysplex data element into a local cache buffer (LCB) in the CPC's main storage, where it has an associated sysplex validity bit. The copying CPC executes a sysplex coherence registration command which requests a SES processor to verify that the data element name already exists in the SES cache, and to store the name of the data element in a SES cache entry if found in the SES cache. Importantly, the registration command communicates to SES the CPC location of the validity bit for the LCB containing that data element copy. Each time another copy of the data element is stored in any CPC LCB, a registration command is executed to store the location of that copy's CPC validity bit into a local cache register (LCR) associated with its data element name. In this manner, each LCR accumulates all CPC locations for all LCB validity bits for all valid copies of the associated data element in the sysplex - for maintaining data coherency throughout the sysplex.
This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2/MVS V4.
A multi-tier indexing method is disclosed for a partitioned table in a parallel or distributed database system. A Local index is created and maintained for each partition of the table and a Coarse Global Index is created and maintained. The Coarse Global Index identifies the indexed partition(s) by partition identifiers (PIDs) and associates the individual Index Key Values with their target partitions so that an access request with a highly partition-selective search predicate on the Index Key can be quickly and easily directed to the target partition(s) for processing. An index maintenance locking protocol is also disclosed which handles the insertion and deletion of index entries and assures the consistency between the Local Index entries and the Global Index entries during concurrent index accesses by different transactions. The locking protocol minimizes locking only to those cases involving an inserted or deleted key and to the key following and possibly the key preceding the inserted or deleted key to allow high concurrency between simultaneous Readers, Inserters, and Deleters. This method enhances the efficiency of complex query evaluation and index maintenance and attains a high throughput for transaction processing.
A computer database system utilizes a method for performing a right outer join of database tables without sorting the inner table (T2). The processing of each tuple in the outer table (T1) includes the preservation in the join output of all tuples in T2 which are in its responsibility region. The initialization step of the process preserves in the join output all of the tuples in T2 which have column set values less than the lowest column set value in T1, i.e. the first tuple in T1, since T1 is sorted or accessed using a sorted index. The responsibility region for tuples in T1, other than the last tuple, is defined as those tuples which have column set values less than the column set value for the next tuple in T1 and greater than or equal to the column set value for the current T1 tuple. The last tuple in T1 must preserve all of the tuples in T2 which have not already been preserved in T2, i.e. all tuples greater than or equal to its column set value. If T1 has duplicate values for the column set value, only the last one preserves the associated T2 tuple. Additional methods for parallel execution of the outer join methods and methods for applying the outer join methods to subqueries (i.e., an All (or universal) Right Join (ARJOIN) and an Existential Right Join (ERJOIN)) are described.
In a combination of multiple concurrently-executing database management systems which share data storage resources, efficient lock processing for shared data is implemented by hiding from a global lock manager the distinction between transaction-interest and cache-interest locks that are processed at the DBMS level. The local lock manager of a DBMS, in response to a request for either type of lock, may issue a request to the global lock manager for a system-level lock without disclosing to the global lock manager the type of lock requested at the local lock manager. After receipt of the system level lock, the local lock manager can grant either transaction or cache interest locks locally on a data resource if the combined mode of locally-held locks on that data resource is greater than or equal to the requested mode.
In a parallel or a distributed database management system, a relation is often horizontally partitioned across multiple nodes. To index a partitioned relation, usually either a global index is maintained for the entire relation, or alternatively, individual local indexes are maintained one for each partition of the relation. The former is costly to maintain because of remote updates and is also costly to use for complex queries, whereas the latter wastes computing resources for highly selective database searches such as those done in the case of transaction processing and is therefore not a scalable solution.
In this paper, a two-tier index is proposed, which consists of local indexes and a coarse global index. This index method is suitable for transaction processing as well as for query processing. It not only is more efficient to maintain and use than the conventional methods, but it also exploits parallelism and is more versatile, scalable, and easier to migrate to for a non-partitioning DBMS. To maintain the consistency between the local indexes and the coarse global index, efficient locking protocols are presented, which are designed to allow a high level of concurrent operation.
In this paper, we first describe R*'s Presumed Abort (PA) commit protocol and SNA's LU 6.2 commit protocol (also called Presumed Nothing (PN)). PA has been widely adopted by different vendors and academic researchers. It is now part of the ISO-OSI and X/Open distributed transaction processing standards. We point out the differences between PA and PN in terms of their features and the underlying models of distributed computations. PA was developed in the context of a distributed data base management system with a restricted model of distributed computation in which there are permanent master-slave relationships amongst the processes of the computation. PN was developed in the context of a more general model of distributed computation with peer to peer relationships amongst the processes. We describe how the two protocols may be merged to benefit from the optimizations present in PA and PN, and to support the more general distributed computation model of PN. This merged protocol is called Generalized Presumed Abort (GPA). GPA has been designed so that it can coexist with the existing PN protocol in the sense that the same transaction may execute in some sites using GPA and in other sites using PN.
GPA is now part of the IBM LU6.2 and DRDA architectures. It has been implemented in DB2 V3.
A computer database system utilizes a method for performing a right outer join of database tables without sorting the inner table (T2). The processing of each tuple in the outer table (T1) includes the preservation in the join output of all tuples in T2 which are in its responsibility region. The initialization step of the process preserves in the join output all of the tuples in T2 which have column set values less than the lowest column set value in T1, i.e. the first tuple in T1, since T1 is sorted or accessed using a sorted index. The responsibility region for tuples in T1, other than the last tuple, is defined as those tuples which have column set values less than the column set value for the next tuple in T1 and greater than or equal to the column set value for the current T1 tuple. The last tuple in T1 must preserve all of the tuples in T2 which have not already been preserved in T2, i.e. all tuples greater than or equal to its column set value. If T1 has duplicate values for the column set value, only the last one preserves the associated T2 tuple. Additional methods for parallel execution of the outer join methods and methods for applying the outer join methods to subqueries (i.e., an All (or universal) Right Join (ARJOIN) and an Existential Right Join (ERJOIN)) are described.
This paper presents general algorithms for concurrency control in tree-based access methods as well as a recovery protocol and a mechanism for ensuring repeatable read. The algorithms are developed in the context of the Generalized Search Tree (GiST) data structure, an index structure supporting an extensible set of queries and data types. Although developed in a GiST context, the algorithms are generally applicable to many tree-based access methods. The concurrency control protocol is based on an extension of the link technique originally developed for B-trees, and completely avoids holding node locks during I/Os. Repeatable read isolation is achieved with a novel combination of predicate locks and two-phase locking of data records. To our knowledge, this is the first time that isolation issues have been addressed outside the context of B-trees. A discussion of the fundamental structural differences between B-trees and more general tree structures like GiSTs explains why the algorithms developed here deviate from their B-tree counterparts. An implementation of GiSTs emulating B-trees in DB2/Common Server is underway.
Alonso, G., Agrawal, D., El Abbadi, A., Mohan, C. Functionalities
and Limitations of Current Workflow Management Systems, Research
Report, IBM Almaden Research Center, 1997.
Workflow systems hold the promise of facilitating the everyday operation
of many enterprises and work environments. As a result, many commercial
workflow management systems have been developed. These systems, although
useful, do not scale well, have limited fault-tolerance, and are
inflexible in terms of interoperating with other workflow systems. In this
paper, we discuss the limitations of contemporary workflow management
systems, and then elaborate on various directions for research and
potential future extensions to the design and modeling of workflow
management systems.
We examine the problems encountered in extending DB2/MVS, an industrial-strength relational data base management system originally designed for a single-system environment, to support the multi-system shared data architecture. The multi-system data sharing function was delivered in DB2 V4R1. DB2 data sharing requires an IBM S/390 Parallel Sysplex environment because DB2's use of the coupling facility technology plays a central role in delivering highly efficient and scalable data sharing functions. We call this the shared data architecture since the coupling facility is a unique feature which is employed with this architecture.
We consider a network of computers consisting of file servers and a Database Management System (DBMS) where a linkage is maintained, with referential integrity, between data in the DBMS and files in the file servers which are external to the DBMS. We present algorithms for performing backup and recovery of the DBMS data in a coordinated fashion with the files on the file servers. When a file is associated (linked) with a record in the DBMS, certain constraints are applied to support referential integrity, access control, and coordinated backup and recovery as if the file is stored in the DBMS. Backup of a referenced file is initiated when the file is linked. The file backup is performed asynchronously to the linking process so that the linking transaction is not delayed. In a typical scenario, when a database backup operation starts, all unfinished file backups are ensured to be completed before the database backup is declared successful. When a database is recovered to a state which includes references to files in one or more file servers, the DBMS ensures that the referenced files are also restored to their correct state in those file servers. However, since database backup and recovery are critical for database availability, the presence of an unavailable file server is tolerated during the database backup and recovery operations. Our algorithms for coordinated backup and recovery have been implemented in the IBM DB2/DataLinks product. The DataLinks concept is also part of the ISO SQL/MED standard.
[LKMPW01] Luo, Q., Krishnamurthy, S., Mohan, C., Pirahesh, H., Woo, H.,
Lindsay, B., Naughton, J. Middle-tier Database Caching
for e-Business, Proc.
ACM SIGMOD International Conference on Management of Data, Madison,
June 2002.
Scaling up to the enormous and growing Internet population with
unpredictable usage patterns, E-commerce applications face severe
challenges in cost and manageability, especially for database servers that
are deployed as those applications’ back-ends in a multi-tier
configuration. Middle-tier database caching is one solution to this
problem. In this paper, we present a simple extension to the existing
federated features in DB2 UDB, which enables a "regular" DB2
instance to become a DBCache without any application
modification. On deployment of a DBCache at an application
server, arbitrary SQL statements generated from the unchanged application
hitherto intended for a back-end database server, can be answered: at the
cache, at the back-end database server, or at both locations in a
distributed manner. The factors that determine the distribution of
workload include the SQL statement type, the cache content, the
application requirement on data freshness, and cost-based optimization at
the cache. We have developed a research prototype of DBCache, and
conducted an extensive set of experiments with an E-Commerce benchmark to
show the benefits of this approach and illustrate tradeoffs in caching
considerations.
[BBHM02] Bhattacharya, S., Brannon, K., Hsiao, H.-I, Mohan, C., Narang, I., Subramanian, M. Coordinating Backup/Recovery and Data Consistency Between Database and File Systems, Proc. ACM SIGMOD International Conference on Management of Data, Madison, June 2002.
Managing a combined store consisting of database data and file data in a robust and consistent manner is a challenge of interest for database systems and content management systems. In such a hybrid system, images, videos, engineering drawings, etc. are stored as files on a file server while meta-data referencing/indexing such files is created and stored in a relational database to take advantage of efficient search. In this paper, we describe solutions for two potentially problematic aspects of such a data management system: backup/recovery and data consistency. We present algorithms for performing backup and recovery of the DBMS data in a coordinated fashion with the files on the file servers. Our algorithms for coordinated backup and recovery have been implemented in the IBM DB2/DataLinks product. We also propose an efficient solution to the problem of maintaining consistency between the content of the file and the associated meta-data stored in the DBMS from a reader's point of view without holding long duration locks on meta-data tables. In the model, an object is directly accessed and edited in-place through normal file system APIs using a reference obtained via an SQL query on the database. To relate file modifications to meta-data updates, the user issues an update through the DBMS, and commits both file and meta-data updates together.
We present a method for
efficiently performing deletions and updates of records when the records
to be deleted or updated are chosen by a range scan on an index. The
traditional method involves numerous unnecessary lock calls and traversals
of the index from root to leaves, especially when the qualifying records'
keys span more than one leaf page of the index. Customers have suffered
performance losses from these inefficiencies and have complained about
them. Our goal was to minimize the number of interactions with the lock
manager, and the number of page fixes, comparison operations and,
possibly, I/Os. Some of our improvements come from increased synergy
between the query planning and data manager components of a DBMS. Our
patented method has been implemented in DB2 V7 to address specific
customer requirements. It has also been done to improve performance on the
TPC-H benchmark.
This method has been implemented in DB2 V7.