Skip to main content

Hadoop on GPFS

Project Description

Cloud computing offers a powerful abstraction that provides a scalable, virtualized infrastructure as a service in which the complexity of fine-grained resource management is hidden from the end-user. Running data analytics applications in the cloud on extremely large data sets is gaining traction as the underlying infrastructure that can meet the extreme demands of scalability. Typically, these applications, such as business intelligence or surveillance video searches, leverage the MapReduce framework that can decompose a large computation into a set of smaller parallelizable computations. One of the key infrastructure elements of the cloud stack for data analytics applications is a storage layer designed to support the following features:

The prevailing trend is to build the storage layer using an Internet scale filesystem such as Google's GFS and its numerous clones including HDFS and Kosmix's KFS. In this project, we revisit the debate on the need of a new non-POSIX storage stack for cloud analytics and argue, based on an initial evaluation, that it can be built on traditional POSIX-based cluster filesystems. Existing deployments of cluster file systems such as Lustre, PVFS, and GPFS show us that they can be extremely scalable without being extremely expensive. Commercial cluster file systems can scale to thousands of nodes while supporting 100 GBps sequential throughput. Furthermore, these file systems can be configured using commodity parts for lower costs without the need for specialized SANs or enterprise-class storage. More importantly, these file systems can support traditional applications that rely on POSIX file APIs and provide a rich set of management tools. Since the cloud storage stack may be shared across different classes of applications it is prudent to rely on standard file interfaces and semantics that can also easily support MapReduce style applications instead of being locked in with a particular non-standard interface.

Selected Publications

  • Guanying Wang, Ali R. Butt, Prashant Pandey, and Karan Gupta, "A Simulation Approach to Evaluating Design Decisions in MapReduce Setups", in Proceedings of the 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, September 2009.
  • Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit Sarkar, Mansi Shah, and Renu Tewari, "Cloud analytics: Do we really need to reinvent the storage stack?", in Proceedings of the 1st USENIX Workshop on Hot Topics in Cloud Computing, June 2009.
  • Guanying Wang, Ali R. Butt, Prashant Pandey, and Karan Gupta, "Using realistic simulation for performance analysis of mapreduce setups", in Proceedings of the 1st ACM workshop on Large-Scale system and application performance, June 2009.

People

  • Prasenjit Sarkar
  • Karan Gupta
  • Reshu Jain
  • Himabindu Pucha
  • Dinesh Subhraveti