Skip to Main content Skip to Navigation

Optimizing data management for MapReduce applications on large-scale distributed infrastructures

Diana Maria Moise 1
1 KerData - Scalable Storage for Clouds and Beyond
Inria Rennes – Bretagne Atlantique , IRISA-D1 - SYSTÈMES LARGE ÉCHELLE
Abstract : Data-intensive applications are nowadays, widely used in various domains to extract and process information, to design complex systems, to perform simulations of real models, etc. These applications exhibit challenging requirements in terms of both storage and computation. Specialized abstractions like Google’s MapReduce were developed to efficiently manage the workloads of data-intensive applications. The MapReduce abstraction has revolutionized the data-intensive community and has rapidly spread to various research and production areas. An open-source implementation of Google's abstraction was provided by Yahoo! through the Hadoop project. This framework is considered the reference MapReduce implementation and is currently heavily used for various purposes and on several infrastructures. To achieve high-performance MapReduce processing, we propose a concurrency-optimized file system for MapReduce Frameworks. As a starting point, we rely on BlobSeer, a framework that was designed as a solution to the challenge of efficiently storing data generated by data-intensive applications running at large scales. We have built the BlobSeer File System (BSFS), with the goal of providing high throughput under heavy concurrency to MapReduce applications. We also study several aspects related to intermediate data management in MapReduce frameworks. We investigate the requirements of MapReduce intermediate data at two levels: inside the same job, and during the execution of pipeline applications. Finally, we show how BSFS can enable extensions to the de facto MapReduce implementation, Hadoop, such as the support for the append operation. This work also comprises the evaluation and the obtained results in the context of grid and cloud environments.
Document type :
Complete list of metadata

Cited literature [65 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Thursday, May 10, 2012 - 4:03:00 PM
Last modification on : Tuesday, June 15, 2021 - 4:13:11 PM
Long-term archiving on: : Saturday, August 11, 2012 - 2:36:16 AM


Version validated by the jury (STAR)


  • HAL Id : tel-00653622, version 3


Diana Maria Moise. Optimizing data management for MapReduce applications on large-scale distributed infrastructures. Other [cs.OH]. École normale supérieure de Cachan - ENS Cachan, 2011. English. ⟨NNT : 2011DENS0067⟩. ⟨tel-00653622v3⟩



Record views


Files downloads