A new method for fault-tolerant routing in arbitrary dimensional meshes is introduced. The method was motivated by certain routing requirements of an initial design of the Blue Gene supercomputer project currently underway in IBM Research. The machine is planned to be organized as a 3-dimensional mesh containing many thousands of nodes. Among the requirements were to provide deterministic deadlock-free wormhole routing in a 3-dimensional mesh, in the presence of many faults (up to a few percent of the number of nodes in the machine), while using two virtual channels. It was also desired to minimize the number of "turns" in each route, i.e., the number of times that the route changes direction. There has been much work on routing methods for meshes that route messages around faults or regions of faults. The new method is to declare certain good nodes to be "lambs"; a lamb is used for routing but not processing, so a lamb is neither the source nor the destination of a message. The lambs are chosen so that every "survivor node", a node that is neither faulty nor a lamb, can reach every survivor node by at most two rounds of dimension-ordered (such as e-cube) routing. An algorithm for finding a set of lambs is presented. The results of simulations on 2D and 3D meshes of various sizes with various numbers of random node faults are given. For example, on a 32-by-32-by-32 3D mesh with 3% random faults, and using two rounds of e-cube routing for each message, the average number of lambs is less than 68, which is less than 7% of the number 983 of faults. The computational complexity of finding the minimum number of lambs for a given fault set is also explored, and this problem is shown to be NP-hard for 3-dimensional meshes with two rounds of e-cube routing.
We construct 2-round (i.e., 2-message), public-coin, black-box (concurrent) zero-knowledge proof systems and arguments for any language in NP under the assumption that the prover is resource-bounded during the execution of the protocol.
The goal of this paper is to establish links between computational complexity theory and the theory and practice of constrained block coding. In particular, the complexities of several fundamental problems in constrained block coding are precisely classified in terms of the existing complexity-theoretic structure. One type of problem studied is that of designing encoder and decoder circuits using minimum or approximately minimum hardware; for our purposes, an "input" to this problem is (i) a deterministic, irreducible finite state transition diagram (abbreviated DIF) defining a set of constrained binary sequences, and (ii) a desired rate p:q. Several of these minimum-encoder and minimum-decoder problems are shown to be NP-hard, and more interestingly some are shown to be complete in the second and third levels of the polynomial-time hierarchy. Another fundamental problem is that of computing the maximum rate of a block code; that is, given a DIF and a codeword length q, find the maximum p such that a rate p:q block code exists for the constraint defined by the DIF. This problem is shown to be {{NP}^{#P}}-complete. Although it is not known whether {NP}^{#P} contains problems of super-polynomial complexity, it lies "higher" in the complexity-class structure than NP in the sense that it is possible, given current knowledge, that {NP}^{#P} contains problems of super-polynomial complexity even if P = NP. Another question studied is whether maximum rate block codes can always be implemented by encoders and decoders of polynomial size. The answer to this question is shown to be closely related to whether the class #P lies "lower" in the complexity-class structure than currently believed -- a proof of either answer to this question would have major implications in complexity theory.
The purpose of this paper is to report results of simulations of two algorithms for free space collection in log-structured storage systems. The algorithms considered are the age-threshold algorithm of Menon and Stockmeyer and the fitness algorithm of Butterworth. The simulations were done using a trace collected by Ruemmler and Wilkes from a file system over a period of two months. The performance of an algorithm is measured by the amount of disk I/O done as a result of free space collection. The performance of the algorithms and several variations of them are compared.
We prove that three apparently unrelated fundamental problems in distributed computing, cryptography, and complexity theory, are essentially the same problem. These three problems and brief descriptions of them follow.
An investigation of interactive proof systems (IPS's) where the verifier is a 2-way probabilistic finite state automaton (2pfa) is initiated. In this model it is shown:
The zero knowledge properties of interactive proof systems (IPS's) are studied in the case that the verifier is a 2-way probabilistic finite state automaton (2pfa). The following results are proved:
The subject of this paper is differential compression, the algorithmic task of finding common strings between versions of data and using them to encode one version compactly by describing it as a set of changes from its companion. A main goal of this work is to present new differencing algorithms that (i) operate at a fine granularity (the atomic unit of change), (ii) make no assumptions about the format or alignment of input data, and (iii) in practice use linear time, use constant space, and give good compression. We present new algorithms, which do not always compress optimally but use considerably less time or space than existing algorithms. One new algorithm runs in O(n) time and O(1) space in the worst case (where each unit of space contains log n bits), as compared to algorithms that run in O(n) time and O(n) space or in O(n²) time and O(1) space. We introduce two new techniques for differential compression and apply these to give additional algorithms that improve compression and time performance. We experimentally explore the properties of our algorithms by running them on actual versioned data. Finally, we present theoretical results that limit the compression power of differencing algorithms that are restricted to making only a single pass over the data.
It is a well-known result of Fagin that the complexity class NP coincides with the class of problems expressible in existential second-order logic (Sigma-1-1), which allows sentences consisting of a string of existential second-order quantifiers followed by a first-order formula. Monadic NP is the class of problems expressible in monadic Sigma-1-1, i.e., Sigma-1-1 with the restriction that the second-order quantifiers are all unary, and hence range only over sets (as opposed to ranging over, say, binary relations). For example, the property of a graph being 3-colorable belongs to monadic NP, because 3-colorability can be expressed by saying that there exists three sets of vertices such that each vertex is in exactly one of the sets and no two vertices in the same set are connected by an edge. Unfortunately, monadic NP is not a robust class, in that it is not closed under first-order quantification. We define closed monadic NP to be the closure of monadic NP under first-order quantification and existential unary second-order quantification. Thus, closed monadic NP differs from monadic NP in that we allow the possibility of arbitrary interleavings of first-order quantifiers among the existential unary second-order quantifiers. We show that closed monadic NP is a natural, rich, and robust subclass of NP. As evidence for its richness, we show that not only is it a proper extension of monadic NP, but that it contains properties not in various other extensions of monadic NP. In particular, we show that closed monadic NP contains an undirected graph property not in the closure of monadic NP under first-order quantification and Boolean operations. Our lower-bound proofs require a number of new game-theoretic techniques.
This paper investigates the placement of data and parity on redundant disk arrays. Declustered organizations have been traditionally used to achieve fast reconstruction of a failed disk's contents. In previous work, Holland and Gibson identified six desirable properties for ideal layouts; however, no declustered layout satisfying all properties has been published in the literature. We present a complete, constructive characterization of the collection of ideal declustered layouts possessing all six properties. Given that ideal layouts exist only for a limited set of configurations, we also present two novel layout families. PRIME and RELPR can tolerate multiple failures in a wide variety of configurations with slight deviations from the ideal. Our simulation studies show that the new layouts provide excellent parallel access performance and reduced incremental loads during degraded operation, when compared with previously published layouts. For large accesses and under high loads, response times for the new layouts are typically smaller than those of previously published declustered layouts by a factor of 2.5.
In this paper, we propose and study a new algorithm for choosing segments for garbage collection in Log-Structured File Systems (LFS) and Log-Structured Arrays (LSA). We compare the performance of our new algorithm against previously known algorithms such as greedy and cost-benefit through simulation. The basic idea of our algorithm is that segments which have been recently filled by writes from the system should be forced to wait for a certain amount of time (the age-threshold) before they are allowed to become candidates for garbage collection. The expectation is that if the age-threshold is properly chosen, segments that have reached the age-threshold are unlikely to get significantly emptier due to future rewrites. Among segments that pass the age-threshold and become candidates for garbage collection, we select ones that will yield the most amount of free space. We show, through simulation, that our age-threshold algorithm is more efficient at garbage collection (produces more free space per garbage-collected segment) than greedy or cost-benefit; this means that designs using age-threshold will give better system performance than designs using greedy or cost-benefit. It is also simpler to implement a scalable version of the age-threshold algorithm than to implement a scalable version of the cost-benefit algorithm. The performance of the age-threshold algorithm depends on good choice of an age-threshold; therefore, we also give an analysis which can be used to choose an optimal age-threshold under certain workload assumptions. We also suggest how to choose good age-thresholds when nothing is known about the workload.
Upper and lower bounds are proved for the time complexity of the problem of reaching agreement in a distributed network in the presence of process failures and inexact information about time. It is assumed that the amount of (real) time between any two consecutive steps of any nonfaulty process is at least c_1 and at most c_2; thus, C = c_2/c_1 is a measure of the timing uncertainty. It is also assumed that the time for message delivery is at most d. Processes are assumed to fail by stopping, so that process failures can be detected by timeouts.
A straightforward adaptation of an (f+1)-round round-based agreement algorithm takes time (f+1)Cd if there are f potential faults, while a straightforward modification of the proof that f+1 rounds are required yields a lower bound of time (f+1)d. The first result of this paper is an agreement algorithm in which the uncertainty factor C is only incurred for one round, yielding a running time of approximately 2fd + Cd in the worst case. (It is assumed that c_2 << d.) The second result shows that any agreement algorithm must take time at least (f-1)d + Cd in the worst case.
The new agreement algorithm can also be applied in a model where processors are synchronous (C=1), and where message delay during a particular execution of the algorithm is bounded above by a quantity delta which could be smaller than the worst-case upper bound d. The running time in this case is approximately (2f-1)delta + d.
The concept of partial synchrony in a distributed system is introduced. Partial synchrony lies between the cases of a synchronous system and an asynchronous system. In a synchronous system, there is a known fixed upper bound Delta on the time required for a message to be sent from one processor to another and a known fixed upper bound Phi on the relative speeds of different processors. In an asynchronous system no fixed upper bounds Delta and Phi exist. In one version of partial synchrony, fixed bounds Delta and Phi exist, but they are not known a priori. The problem is to design protocols that work correctly in the partially synchronous system regardless of the actual values of the bounds Delta and Phi. In another version of partial synchrony, the bounds are known, but are only guaranteed to hold starting at some unknown time T, and protocols must be designed to work correctly regardless of when time T occurs. Fault-tolerant consensus protocols are given for various cases of partial synchrony and various fault models. Lower bounds that show in most cases that our protocols are optimal with respect to the number of faults tolerated are also given. Our consensus protocols for partially synchronous processors use new protocols for fault-tolerant "distributed clocks" that allow partially synchronous processors to reach some approximately common notion of time.
Reaching agreement is a primitive of distributed computing. Whereas this poses no problem in an ideal, failure-free environment, it imposes certain constraints on the capabilities of an actual system: A system is viable only if it permits the existence of consensus protocols tolerant to some number of failures. Fischer et al. have shown that in a completely asynchronous model, even one failure cannot be tolerated. In this paper their work is extended: Several critical system parameters, including various synchrony conditions, are identified and how varying these affects the number of faults that can be tolerated is examined. The proofs expose general heuristic principles that explain why consensus is possible in certain models but not possible in others.
It is shown that if a 2-way probabilistic finite state automaton (2pfa) M recognizes a nonregular language L with error probability bounded below 1/2, then there is a positive constant b (depending on M) such that, for infinitely many inputs x, the expected running time of M on input x must exceed 2^{n^b} where n is the length of x. This complements a result of Freivalds showing that 2pfa's can recognize certain nonregular languages in exponential expected time. It also establishes a time complexity gap for 2pfa's, since any regular language can be recognized by some 2pfa in linear time. Other results give roughly exponential upper and lower bounds on the worst-case increase in the number of states when converting a polynomial-time 2pfa to an equivalent 2-way nondeterministic finite-state automaton or to an equivalent 1-way deterministic finite-state automaton.