Skip to Main content Skip to Navigation
Theses

Querying semantic web/linked data graphs using summarization

Abstract : The amount of RDF data available increases fast both in size and complexity, making available RDF Knowledge Bases (KBs) with millions or even billions of triples something usual, e.g. more than 1000 datasets are now published as part of the Linked Open Data (LOD) cloud, which contains more than 62 billion RDF triples, forming big and complex RDF data graphs. This explosion of size, complexity and number of available RDF Knowledge Bases (KBs) and the emergence of Linked Datasets made querying, exploring, visualizing, and understanding the data in these KBs difficult both from a human (when trying to visualize) and a machine (when trying to query or compute) perspective. To tackle this problem, we propose a method of summarizing a large RDF KBs based on representing the RDF graph using the (best) top-k approximate RDF graph patterns. The method is named SemSum+ and extracts the meaningful/descriptive information from RDF Knowledge Bases and produces a succinct overview of these RDF KBs. It extracts from the RDF graph, an RDF schema that describes the actual contents of the KB, something that has various advantages even compared to an existing schema, which might be partially used by the data in the KB. While computing the approximate RDF graph patterns, we also add information on the number of instances each of the patterns represents. So, when we query the RDF summary graph, we can easily identify whether the necessary information is present and if it is present in significant numbers whether to be included in a federated query result. The method we propose does not require the presence of the initial schema of the KB and works equally well when there is no schema information at all (something realistic with modern KBs that are constructed either ad-hoc or by merging fragments of other existing KBs). Additionally, the proposed method works equally well with homogeneous (having the same structure) and heterogeneous (having different structure, possibly the result of data described under different schemas/ontologies) RDF graphs.Given that RDF graphs can be large and complex, methods that need to compute the summary by fitting the whole graph in the memory of a (however large) machine will not scale. In order to overcome this problem, we proposed, as part of this thesis, a parallel framework that allows us to have a scalable parallel version of our proposed method. This will allow us to compute the summaries of any RDF graph regardless of size. Actually, we generalized this framework so as to be usable by any approximate pattern mining algorithm that needs parallelization.But working on this problem, introduced us to the issue of measuring the quality of the produced summaries. Given that in the literature exist various algorithms that can be used to summarize RDF graphs, we need to understand which one is better suited for a specific task or a specific RDF KB. In the literature, there is a lack of widely accepted evaluation criteria or an extensive empirical evaluation. This leads to the necessity of a method to compare and evaluate the quality of the produced summaries. So, in this thesis, we provide a comprehensive Quality Framework for RDF Graph Summarization to cover the gap that exists in the literature. This framework allows a better, deeper and more complete understanding of the quality of the different summaries and facilitates their comparison. It is independent of the way RDF summarization algorithms work and makes no assumptions on the type or structure neither of the input nor of the final results. We provide a set of metrics that help us understand not only if this is a valid summary but also how a summary compares to another in terms of the specified quality characteristic(s). The framework has the ability, which was experimentally validated, to capture subtle differences among summaries and produce metrics that depict that and was used to provide an extensive experimental evaluation and comparison of our method.
Document type :
Theses
Complete list of metadatas

Cited literature [122 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02861761
Contributor : Abes Star :  Contact
Submitted on : Tuesday, June 9, 2020 - 10:56:16 AM
Last modification on : Wednesday, October 14, 2020 - 4:05:17 AM

File

61663_ZNEIKA_2019_diffusion.pd...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02861761, version 1

Collections

Citation

Mussab Zneika. Querying semantic web/linked data graphs using summarization. Technology for Human Learning. Université de Cergy Pontoise, 2019. English. ⟨NNT : 2019CERG1010⟩. ⟨tel-02861761⟩

Share

Metrics

Record views

70

Files downloads

34