Recovery Oriented Computing: Does an Autonomic System
Need a Doctor?
|
Abstract:
It is time to broaden our performance-dominated research agenda. Four
orders of magnitude increase in performance in the last 20 years means
that few outside the CS&E research community believe that speed is
the only problem of computer hardware and software. Current systems
crash and freeze so frequently that people become violent, and fast
but flaky should not be our legacy for the 21st century.
Recovery
Oriented Computing (ROC) takes the perspective that hardware faults,
software bugs, and operator errors are facts to be coped with, not
problems to be solved. By concentrating on Mean Time to Repair (MTTR)
rather than Mean Time to Failure (MTTF), ROC reduces time to recover
from these facts and thus offers higher availability. Since a large
portion of system administration is dealing with failures, ROC may
also reduce total cost of ownership. One to two orders of magnitude
reduction in cost over the last 20 years mean that the purchase
price of hardware and software is now a small part of the total
cost of ownership.
In
addition to giving the motivation and definition of ROC, we introduce
quantitative failure data for Internet sites and the public telephone
system, which suggest that operator error is a leading cause of
outages. We also present results of testing five ROC techniques:
partitioning, fault insertion, reversible systems, automatic diagnosis
aid, and defense in depth.
Many
of these ideas are inspired by work in other fields, from analysis
of disasters like Three Mile Island to observations about diplomacy
in the Mideast Conflict. We also describe the Automation Irony,
which suggests that automation can make things worse for operators.
(Hence, we think Autonomic Computing is an exciting goal, but there
may still be a need for doctors for many years to come.)
If
our field embraces availability and maintainability, systems of
the future may compete on recovery performance rather than just
SPEC or TPC performance, and on total cost of ownership rather than
just system price. Such a change may restore our pride in the architectures
and operating systems we craft.
David Patterson joined
the faculty at the University of California at Berkeley in 1977,
where he now holds the Pardee Chair of Computer Science. He is a
member of the National Academy of Engineering and is a fellow of
both the ACM (Association for Computing Machinery) and the IEEE
(Institute of Electrical and Electronics Engineers).
He led the design and implementation of RISC
I, likely the first VLSI Reduced Instruction Set Computer. This
research became the foundation of the SPARC architecture, used
by Sun Microsystems and others. He was a leader, along with Randy
Katz, of the Redundant Arrays of Inexpensive Disks project (or
RAID), which led to reliable storage systems from many companies.
He is co-author of five books, including two with John Hennessy,
who is now President of Stanford University. Patterson has been
chair of the CS division at Berkeley, the ACM SIG in computer
architecture, and the Computing Research Association.
His teaching has been honored by the ACM,
the IEEE, and the University of California. Patterson shared the
1999 IEEE Reynold Johnson Information Storage Award with Randy
Katz for the development of RAID and shared the 2000 IEEE von
Neumann medal with John Hennessy for “creating a revolution in
computer architecture through their exploration, popularization,
and commercialization of architectural innovations."
|