|
IBM's Programmable Easy-to-use Reliable Computing System (PERCS) was selected by the
Defense Advanced Research Projects Agency (DARPA) as one of two system designs to be
developed and demonstrated as part of phase III of the High Productivity Computing Systems program
(HPCS). Such designs must support the eventual scaling of sustained
computation to 10 petaflops and a software environment that enables domain
experts to effectively use that computing power. These capabilities are required
by HPCS to meet the need for commercially successful petascale computing systems
for high-end users in government, science and industry in 2010. HPCS expects such
systems to enable key US advances in weapons analysis, intelligence/surveillance,
reconnaissance, cryptanalysis, airborne contaminant modeling, nuclear stockpile
management and the exploration of new realms in biological and physical sciences.
PERCS will meet these goals with a scalable system based on future POWER series
technologies. The PERCS program will substantially increase the research and
development activities in IBM technologies planned for 2010 and beyond. These
will enable IBM to meet the HPCS goals and enhance the capabilities of IBM's
line of business systems. This will entail IBM making significant investments
in the next generation of the following technologies:
- Power processor technology (POWER7)
- IBM AIX and Linux operating systems
- IBM's Parallel Environment
- HPC software stack
- Software development tools that scale to more than 100,000 processors
- IBM's Interconnect
- IBM's General Parallel File System (GPFS)
- Storage subsystem
Almaden Storage Systems researchers will contribute the changes to GPFS and
the storage system design to PERCS. The HPCS requirements for file systems are:
- 1 trillion files in a single file system
- 32,000 file creates per second
- 10,000 metadata operations per second
- Single-node full-duplex streaming I/O bandwidth faster than 30GB/s
- Support for more than 30K nodes
This is a tall order. The bandwidth requirement for such a system will be several
TB/s. This is far greater than 10 times the current capabilities of GPFS, which is the world's fastest
file system. Almaden researchers will enhance GPFS and extend the benchmarking suite
to enable validation. There are further challenges associated with the size of
these systems. There will likely be more than 100,000 disk drives in the storage
system. Given the expected failure rates of disk drives, there will be multiple
failed disks in the system at any given point in time. Our approach to storage
resiliency will be enhanced with extended RAID algorithms, consolidation of hardware
and software aspects of storage and a design focus on full performance in the
presence of failed components. The number of components also exacerbates the
problem of storage management. Manual management of a system of this size would be
impossible. Storage systems will enhance the scaling, integration and automation
of the storage management tools.
|