Improving MapReduce Performance on Clusters

Sylvain Gault 1, 2
Abstract : Nowadays, more and more scientific fields rely on data mining to produce new results. These raw data are produced at an increasing rate by several tools like DNA sequencers in biology, the Large Hadron Collider (LHC) in physics that produced 25 petabytes per year as of 2012, or the Large Synoptic Survey Telescope (LSST) that should produce 30 petabyte of data per night. High-resolution scanners in medical imaging and social networks also produce huge amounts of data. This data deluge raise several challenges in terms of storage and computer processing. The Google company proposed in 2004 to use the MapReduce model in order to distribute the computation across several computers.This thesis focus mainly on improving the performance of a MapReduce environment. In order to easily replace the software parts needed to improve the performance, designing a modular and adaptable MapReduce environment is necessary. This is why a component based approach is studied in order to design such a programming environment. In order to study the performance of a MapReduce application, modeling the platform, the application and their performance is mandatory. These models should be both precise enough for the algorithms using them to produce meaningful results, but also simple enough to be analyzed. A state of the art of the existing models is done and a new model adapted to the needs is defined. On order to optimise a MapReduce environment, the first studied approach is a global optimization which result in a computation time reduced by up to 47 %. The second approach focus on the shuffle phase of MapReduce when all the nodes may send some data to every other node. Several algorithms are defined and studied when the network is the bottleneck of the data transfers. These algorithms are tested on the Grid'5000 experiment platform and usually show a behavior close to the lower bound while the trivial approach is far from it.
Document type :
Theses
Complete list of metadatas

Cited literature [97 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01146365
Contributor : Abes Star <>
Submitted on : Tuesday, April 28, 2015 - 11:47:05 AM
Last modification on : Wednesday, November 20, 2019 - 3:27:41 AM
Long-term archiving on: Monday, September 14, 2015 - 2:31:01 PM

File

GAULT_Sylvain_2015_These.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01146365, version 1

Citation

Sylvain Gault. Improving MapReduce Performance on Clusters. Other [cs.OH]. Ecole normale supérieure de lyon - ENS LYON, 2015. English. ⟨NNT : 2015ENSL0985⟩. ⟨tel-01146365⟩

Share

Metrics

Record views

701

Files downloads

1787