IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
 Select a country
 IBM Almaden Home
Almaden Institute 2002
Agenda
Contacts
 
Almaden Institute 2001
 
 


Almaden Institute
   Recovery Oriented Computing: Does an Autonomic System Need a Doctor?

Abstract:

It is time to broaden our performance-dominated research agenda. Four orders of magnitude increase in performance in the last 20 years means that few outside the CS&E research community believe that speed is the only problem of computer hardware and software. Current systems crash and freeze so frequently that people become violent, and fast but flaky should not be our legacy for the 21st century.

Recovery Oriented Computing (ROC) takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved. By concentrating on Mean Time to Repair (MTTR) rather than Mean Time to Failure (MTTF), ROC reduces time to recover from these facts and thus offers higher availability. Since a large portion of system administration is dealing with failures, ROC may also reduce total cost of ownership. One to two orders of magnitude reduction in cost over the last 20 years mean that the purchase price of hardware and software is now a small part of the total cost of ownership.

In addition to giving the motivation and definition of ROC, we introduce quantitative failure data for Internet sites and the public telephone system, which suggest that operator error is a leading cause of outages. We also present results of testing five ROC techniques: partitioning, fault insertion, reversible systems, automatic diagnosis aid, and defense in depth.

Many of these ideas are inspired by work in other fields, from analysis of disasters like Three Mile Island to observations about diplomacy in the Mideast Conflict. We also describe the Automation Irony, which suggests that automation can make things worse for operators. (Hence, we think Autonomic Computing is an exciting goal, but there may still be a need for doctors for many years to come.)

If our field embraces availability and maintainability, systems of the future may compete on recovery performance rather than just SPEC or TPC performance, and on total cost of ownership rather than just system price. Such a change may restore our pride in the architectures and operating systems we craft.

 David A. Patterson - Bio
Photo of David Patterson
David A. Patterson:
Professor, Computer Science Division,
University of California at Berkeley

patterson@cs.berkeley.edu

Web sites:
http://www.cs.berkeley.edu/~pattrsn/

David Patterson joined the faculty at the University of California at Berkeley in 1977, where he now holds the Pardee Chair of Computer Science. He is a member of the National Academy of Engineering and is a fellow of both the ACM (Association for Computing Machinery) and the IEEE (Institute of Electrical and Electronics Engineers).

He led the design and implementation of RISC I, likely the first VLSI Reduced Instruction Set Computer. This research became the foundation of the SPARC architecture, used by Sun Microsystems and others. He was a leader, along with Randy Katz, of the Redundant Arrays of Inexpensive Disks project (or RAID), which led to reliable storage systems from many companies. He is co-author of five books, including two with John Hennessy, who is now President of Stanford University. Patterson has been chair of the CS division at Berkeley, the ACM SIG in computer architecture, and the Computing Research Association.

His teaching has been honored by the ACM, the IEEE, and the University of California. Patterson shared the 1999 IEEE Reynold Johnson Information Storage Award with Randy Katz for the development of RAID and shared the 2000 IEEE von Neumann medal with John Hennessy for “creating a revolution in computer architecture through their exploration, popularization, and commercialization of architectural innovations."

  
  About IBM  |  Privacy  |  Legal  |  Contact