Partitionnement dans les systèmes de gestion de données parallèles

Miguel Liroz-Gistau 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : During the last years, the volume of data that is captured and generated has exploded. Advances in computer technologies, which provide cheap storage and increased computing capabilities, have allowed organizations to perform complex analysis on this data and to extract valuable knowledge from it. This trend has been very important not only for industry, but has also had a significant impact on science, where enhanced instruments and more complex simulations call for an efficient management of huge quantities of data.Parallel computing is a fundamental technique in the management of large quantities of data as it leverages on the concurrent utilization of multiple computing resources. To take advantage of parallel computing, we need efficient data partitioning techniques which are in charge of dividing the whole data and assigning the partitions to the processing nodes. Data partitioning is a complex problem, as it has to consider different and often contradicting issues, such as data locality, load balancing and maximizing parallelism.In this thesis, we study the problem of data partitioning, particularly in scientific parallel databases that are continuously growing and in the MapReduce framework.In the case of scientific databases, we consider data partitioning in very large databases in which new data is appended continuously to the database, e.g. astronomical applications. Existing approaches are limited since the complexity of the workload and continuous appends restrict the applicability of traditional approaches. We propose two partitioning algorithms that dynamically partition new data elements by a technique based on data affinity. Our algorithms enable us to obtain very good data partitions in a low execution time compared to traditional approaches.We also study how to improve the performance of MapReduce framework using data partitioning techniques. In particular, we are interested in efficient data partitioning of the input datasets to reduce the amount of data that has to be transferred in the shuffle phase. We design and implement a strategy which, by capturing the relationships between input tuples and intermediate keys, obtains an efficient partitioning that can be used to reduce significantly the MapReduce's communication overhead.
Document type :
Theses
Complete list of metadatas

Cited literature [80 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01023039
Contributor : Abes Star <>
Submitted on : Friday, July 11, 2014 - 2:07:09 PM
Last modification on : Monday, June 17, 2019 - 6:04:03 PM
Long-term archiving on: Saturday, October 11, 2014 - 12:35:10 PM

File

36639_LIROZ_2013_archivage_cor...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01023039, version 1

Collections

Citation

Miguel Liroz-Gistau. Partitionnement dans les systèmes de gestion de données parallèles. Base de données [cs.DB]. Université Montpellier II - Sciences et Techniques du Languedoc, 2013. Français. ⟨NNT : 2013MON20117⟩. ⟨tel-01023039⟩

Share

Metrics

Record views

514

Files downloads

367