Skip to Main content Skip to Navigation

Optimizing Communication Cost in Distributed Query Processing

Abstract : In this thesis, we take a complementary look to the problem of optimizing the time for communicating query results in distributed query processing, by investigating the relationship between the communication time and the middleware configuration. Indeed, the middleware determines, among others, how data is divided into batches and messages before being communicated over the network. Concretely, we focus on the research question: given a query Q and a network environment, what is the best middleware configuration that minimizes the time for transferring the query result over the network? To the best of our knowledge, the database research community does not have well-established strategies for middleware tuning. We present first an intensive experimental study that emphasizes the crucial impact of middleware configuration on the time for communicating query results. We focus on two middleware parameters that we empirically identified as having an important influence on the communication time: (i) the fetch size F (i.e., the number of tuples in a batch that is communicated at once to an application consuming the data) and (ii) the message size M (i.e., the size in bytes of the middleware buffer, which corresponds to the amount of data that can be communicated at once from the middleware to the network layer; a batch of F tuples can be communicated via one or several messages of M bytes). Then, we describe a cost model for estimating the communication time, which is based on how data is communicated between computation nodes. Precisely, our cost model is based on two crucial observations: (i) batches and messages are communicated differently over the network: batches are communicated synchronously, whereas messages in a batch are communicated in pipeline (asynchronously), and (ii) due to network latency, it is more expensive to communicate the first message in a batch compared to any other message that is not the first in its batch. We propose an effective strategy for calibrating the network-dependent parameters of the communication time estimation function i.e, the costs of first message and non first message in their batch. Finally, we develop an optimization algorithm to effectively compute the values of the middleware parameters F and M that minimize the communication time. The proposed algorithm allows to quickly find (in small fraction of a second) the values of the middleware parameters F and M that translate a good trade-off between low resource consumption and low communication time. The proposed approach has been evaluated using a dataset issued from application in Astronomy.
Document type :
Complete list of metadatas

Cited literature [38 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Thursday, March 29, 2018 - 10:02:06 AM
Last modification on : Wednesday, March 4, 2020 - 12:28:03 PM
Long-term archiving on: : Thursday, September 13, 2018 - 11:43:29 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01746126, version 1



Abdeslem Belghoul. Optimizing Communication Cost in Distributed Query Processing. Databases [cs.DB]. Université Clermont Auvergne, 2017. English. ⟨NNT : 2017CLFAC025⟩. ⟨tel-01746126⟩



Record views


Files downloads