Optimizing data management for MapReduce applications on large-scale distributed infrastructures

Diana Maria Moise 1
1 KerData - Scalable Storage for Clouds and Beyond
Inria Rennes – Bretagne Atlantique , IRISA-D1 - SYSTÈMES LARGE ÉCHELLE
Abstract : Data-intensive applications are nowadays, widely used in various domains to extract and process information, to design complex systems, to perform simulations of real models, etc. These applications exhibit challenging requirements in terms of both storage and computation. Specialized abstractions like Google's MapReduce were developed to efficiently manage the workloads of data-intensive applications. The MapReduce abstraction has revolutionized the data-intensive community and has rapidly spread to various research and production areas. An open-source implementation of Google's abstraction was provided by Yahoo! through the Hadoop project. This framework is considered the reference MapReduce implementation and is currently heavily used for various purposes and on several infrastructures. To achieve high-performance MapReduce processing, we propose a concurrency-optimized file system for MapReduce Frameworks. As a starting point, we rely on BlobSeer, a framework that was designed as a solution to the challenge of efficiently storing data generated by data-intensive applications running at large scales. We have built the BlobSeer File System (BSFS), with the goal of providing high throughput under heavy concurrency to MapReduce applications. We also study several aspects related to intermediate data management in MapReduce frameworks. We investigate the requirements of MapReduce intermediate data at two levels: inside the same job, and during the execution of pipeline applications. Finally, we show how BSFS can enable extensions to the de facto MapReduce implementation, Hadoop, such as the support for the append operation. This work also comprises the evaluation and the obtained results in the context of grid and cloud environments.
Document type :
Theses
Other [cs.OH]. École normale supérieure de Cachan - ENS Cachan, 2011. English. <NNT : 2011DENS0067>


https://tel.archives-ouvertes.fr/tel-00653622
Contributor : Abes Star <>
Submitted on : Thursday, May 10, 2012 - 4:03:00 PM
Last modification on : Thursday, May 14, 2015 - 1:02:07 AM

File

Moise2011.pdf
fileSource_public_star

Identifiers

  • HAL Id : tel-00653622, version 3

Collections

Citation

Diana Maria Moise. Optimizing data management for MapReduce applications on large-scale distributed infrastructures. Other [cs.OH]. École normale supérieure de Cachan - ENS Cachan, 2011. English. <NNT : 2011DENS0067>. <tel-00653622v3>

Export

Share

Metrics

Consultation de
la notice

639

Téléchargement du document

403