Skip to Main content Skip to Navigation

Improving MapReduce Performance on Clusters

Sylvain Gault 1, 2
Abstract : Nowadays, more and more scientific fields rely on data mining to produce new results. These raw data are produced at an increasing rate by several tools like DNA sequencers in biology, the Large Hadron Collider (LHC) in physics that produced 25 petabytes per year as of 2012, or the Large Synoptic Survey Telescope (LSST) that should produce 30 petabyte of data per night. High-resolution scanners in medical imaging and social networks also produce huge amounts of data. This data deluge raise several challenges in terms of storage and computer processing. The Google company proposed in 2004 to use the MapReduce model in order to distribute the computation across several computers.This thesis focus mainly on improving the performance of a MapReduce environment. In order to easily replace the software parts needed to improve the performance, designing a modular and adaptable MapReduce environment is necessary. This is why a component based approach is studied in order to design such a programming environment. In order to study the performance of a MapReduce application, modeling the platform, the application and their performance is mandatory. These models should be both precise enough for the algorithms using them to produce meaningful results, but also simple enough to be analyzed. A state of the art of the existing models is done and a new model adapted to the needs is defined. On order to optimise a MapReduce environment, the first studied approach is a global optimization which result in a computation time reduced by up to 47 %. The second approach focus on the shuffle phase of MapReduce when all the nodes may send some data to every other node. Several algorithms are defined and studied when the network is the bottleneck of the data transfers. These algorithms are tested on the Grid'5000 experiment platform and usually show a behavior close to the lower bound while the trivial approach is far from it.
Document type :
Complete list of metadata

Cited literature [97 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Tuesday, April 28, 2015 - 11:47:05 AM
Last modification on : Monday, May 4, 2020 - 11:38:44 AM
Long-term archiving on: : Monday, September 14, 2015 - 2:31:01 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01146365, version 1


Sylvain Gault. Improving MapReduce Performance on Clusters. Other [cs.OH]. Ecole normale supérieure de lyon - ENS LYON, 2015. English. ⟨NNT : 2015ENSL0985⟩. ⟨tel-01146365⟩



Record views


Files downloads