Skip to Main content Skip to Navigation

Querying heterogeneous data in NoSQL document stores

Hamdi Ben Hamadou 1
1 IRIT-SIG - Systèmes d’Informations Généralisées
IRIT - Institut de recherche en informatique de Toulouse
Abstract : This thesis discusses the problems related to querying heterogeneous data in document-oriented systems. Document-oriented "not-only SQL" (noSQL) storage systems have undergone significant development in recent years due to their ability to manage large amounts of documents in a flexible and efficient manner. These systems rely on the "schema-less" concept where no there is no requirement to consider a single schema for a set of data, called a collection of documents. This flexibility in data structures makes the query formulation more complex and users need to know all the different schemas of the data manipulated during the query formulation. The work developed in this thesis subscribes into the frame of neOCampus project. It focuses on issues in the manipulation and the querying of structurally heterogeneous document collections, mainly the problem of variable schemas. We propose the construction of a dictionary of data that makes it possible to find all the schemas of the documents. Each key, a dictionary entry, corresponds to an absolute or partial path existing in at least one document of the collection. This key is associated to all the corresponding absolute paths throughout the collection of heterogeneous documents. The dictionary is then exploited to automatically and transparently reformulate queries from users. The user queries are formulated using the dictionary keys (partial or absolute paths) and are automatically reformulated using the dictionary to consider all the existing paths in all documents in the collection. In this thesis, we conduct a state-of-the-art survey of the work related to solving the problem of querying data of heterogeneous structures, and we propose a classification. Then, we compare these works according to criteria that make it possible to position our contribution. We formally define the classical concepts related to document-oriented systems (document, collection, etc). Then, we extend this formalisation with additional concepts: absolute and partial paths, document schemas, dictionary. For manipulating and querying heterogeneous documents, we define a closed minimal algebraic kernel composed of five operators: selection, projection, unnest, aggregation and join (left join). We define each operator and explain its classical evaluation by the native document querying engine. Then we establish the reformulation rules of each of these operators based on the use of the dictionary. We define the process of reformulating user queries that produces a query that can be evaluated by most document querying engines while keeping the logic of the classical operators (misleading paths, null values). We show how the reformulation of a query initially constructed with partial and/or absolute paths makes it possible to solve the problem of structural heterogeneity of documents. Finally, we conduct experiments to validate the formal concepts that we introduce throughout this thesis. We evaluate the construction and maintenance of the dictionary by changing the configuration in terms of number of structures per collection studied and collection size. Then, we evaluate the query reformulation engine by comparing it to a query evaluation in a context without structural heterogeneity and then in a context of executing multiple queries. All our experiments were conducted on synthetic collections with several levels of nesting, different numbers of structures per collection, and on varying collection sizes. Recently, we deployed our contributions in the neOCampus project to query heterogeneous sensors data installed at different classrooms and the library at the campus of the university of Toulouse III-Paul Sabatier.
Document type :
Complete list of metadata

Cited literature [92 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Friday, May 29, 2020 - 2:58:07 PM
Last modification on : Wednesday, November 3, 2021 - 6:52:30 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03163663, version 2


Hamdi Ben Hamadou. Querying heterogeneous data in NoSQL document stores. Databases [cs.DB]. Université Paul Sabatier - Toulouse III, 2019. English. ⟨NNT : 2019TOU30146⟩. ⟨tel-03163663v2⟩



Record views


Files downloads