Skip to Main content Skip to Navigation

Efficient techniques for large-scale Web data management

Jesus Camacho Rodriguez 1, 2
1 OAK - Database optimizations and architectures for complex large data
Inria Saclay - Ile de France, LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies.
Document type :
Complete list of metadatas

Cited literature [140 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Monday, December 1, 2014 - 4:05:38 PM
Last modification on : Monday, October 19, 2020 - 11:01:14 AM
Long-term archiving on: : Saturday, April 15, 2017 - 12:26:24 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01089388, version 1



Jesus Camacho Rodriguez. Efficient techniques for large-scale Web data management. Databases [cs.DB]. Université Paris Sud - Paris XI, 2014. English. ⟨NNT : 2014PA112229⟩. ⟨tel-01089388⟩



Record views


Files downloads