Modelling and executing multidimensional data analysis applications over distributed architectures.

Abstract : Along with the development of hardware and software, more and more data is generated at a rate much faster than ever. Processing large volume of data is becoming a challenge for data analysis software. Additionally, short response time requirement is demanded by interactive operational data analysis tools. For addressing these issues, people look for solutions based on parallel computing. Traditional approaches rely on expensive high-performing hardware, like supercomputers. Another approach using commodity hardware has been less investigated. In this thesis, we are aiming to utilize commodity hardware to resolve these issues. We propose to utilize a parallel programming model issued from Cloud Computing, MapReduce, to parallelize multidimensional analytical query processing for benefit its good scalability and fault-tolerance mechanisms. In this work, we first revisit the existing techniques for optimizing multidimensional data analysis query, including pre-computing, indexing, data partitioning, and query processing parallelism. Then, we study the MapReduce model in detail. The basic idea of MapReduce and the extended MapCombineReduce model are presented. Especially, we analyse the communication cost of a MapReduce procedure. After presenting the data storage works with MapReduce, we discuss the features of data management applications suitable for Cloud Computing, and the utilization of MapReduce for data analysis applications in existing work. Next, we focus on the MapReduce-based parallelization for Multiple Group-by query, a typical query used in multidimensional data exploration. We present the MapReduce-based initial implementation and a MapCombineReduce-based optimization. According to the experimental results, our optimized version shows a better speed-up and a better scalability than the other version. We also give formal execution time estimation for both the initial implementation and the optimized one. In order to further optimize the processing of Multiple Group-by query processing, a data restructure phase is proposed to optimize individual job execution. We redesign the organization of data storage. We apply, data partitioning, inverted index and data compressing techniques, during data restructure phase. We redefine the MapReduce job's calculations, and job scheduling relying on the new data structure. Based on a measurement of execution time we give a formal estimation. We find performance impacting factors, including query selectivity, concurrently running mapper number on one node, hitting data distribution, intermediate output size, adopted serialization algorithms, network status, whether using combiner or not as well as the data partitioning methods. We give an estimation model for the query processing's execution time, and specifically estimated the values of various parameters for data horizontal partitioning-based query processing. In order to support more flexible distinct-value-wise job-scheduling, we design a new compressed data structure, which works with vertical partition. It allows the aggregations over one certain distinct value to be performed within one continuous process.
Document type :
Engineering Sciences [physics]. Ecole Centrale Paris, 2010. English. <NNT : 2010ECAP0040>
Contributor : Abes Star <>
Submitted on : Wednesday, March 23, 2011 - 10:26:23 AM
Last modification on : Friday, August 19, 2011 - 2:37:14 PM
Document(s) archivé(s) le : Thursday, November 8, 2012 - 12:21:16 PM




  • HAL Id : tel-00579125, version 1



Jie Pan. Modelling and executing multidimensional data analysis applications over distributed architectures.. Engineering Sciences [physics]. Ecole Centrale Paris, 2010. English. <NNT : 2010ECAP0040>. <tel-00579125>




Notice views


Document downloads