Skip to Main content Skip to Navigation

Clustering-based Approximate Answering of Query Result in Large and Distributed Databases

Abstract : Database systems are increasingly used for interactive and exploratory data retrieval. In such re- trievals, users queries often result in too many answers, so users waste significant time and efforts sifting and sorting through these answers to find the relevant ones. In this thesis, we first propose an efficient and effective algorithm coined Explore-Select-Rearrange Algorithm (ESRA), based on the SAINTETIQ model, to quickly provide users with hierarchical clustering schemas of their query re- sults. SAINTETIQ is a domain knowledge-based approach that provides multi-resolution summaries of structured data stored into a database. Each node (or summary) of the hierarchy provided by ESRA describes a subset of the result set in a user-friendly form based on domain knowledge. The user then navigates through this hierarchy structure in a top-down fashion, exploring the summaries of interest while ignoring the rest. Experimental results show that the ESRA algorithm is efficient and provides well-formed (tight and clearly separated) and well-organized clusters of query results. The ESRA al- gorithm assumes that the summary hierarchy of the queried data is already built using SAINTETIQ and available as input. However, SAINTETIQ requires full access to the data which is going to be summarized. This requirement severely limits the applicability of the ESRA algorithm in a distributed environment, where data is distributed across many sites and transmitting the data to a central site is not feasible or even desirable. The second contribution of this thesis is therefore a solution for sum- marizing distributed data without a prior “unification” of the data sources. We assume that the sources maintain their own summary hierarchies (local models), and we propose new algorithms for merging them into a single final one (global model). An experimental study shows that our merging algorithms result in high quality clustering schemas of the entire distributed data and are very efficient in terms of computational time.
Document type :
Complete list of metadatas

Cited literature [56 references]  Display  Hide  Download
Contributor : Marc Gelgon <>
Submitted on : Friday, April 23, 2010 - 12:16:55 PM
Last modification on : Wednesday, April 11, 2018 - 1:57:10 AM
Long-term archiving on: : Tuesday, September 28, 2010 - 12:21:59 PM


  • HAL Id : tel-00475917, version 1



Mounir Bechchi. Clustering-based Approximate Answering of Query Result in Large and Distributed Databases. Human-Computer Interaction [cs.HC]. Université de Nantes, 2009. English. ⟨tel-00475917⟩



Record views


Files downloads