Skip to Main content Skip to Navigation
Theses

Découverte automatique de schéma pour les données irrégulières et massives

Abstract : The web of data is a huge global data space, relying on semantic web technologies, where a high number of sources are published and interlinked. This data space provides an unprecedented amount of knowledge available for novel applications, but the meaningful usage of its sources is often difficult due to the lack of schema describing the content of these data sources. Several automatic schema discovery approaches have been proposed, but while they provide good quality schemas, their use for massive data sources is a challenge as they rely on costly algorithms. In our work, we are interested in both the scalability and the incrementality of schema discovery approaches for RDF data sources where the schema is incomplete or missing.Furthermore, we extend schema discovery to take into account not only the explicit information provided by a data source, but also the implicit information which can be inferred.Our first contribution consists of a scalable schema discovery approach which extracts the classes describing the content of a massive RDF data source.We have proposed to extract a condensed representation of the source, which will be used as an input to the schema discovery process in order to improve its performances.This representation is a set of patterns, each one representing a combination of properties describing some entities in the dataset. We have also proposed a scalable schema discovery approach relying on a distributed clustering algorithm that forms groups of structurally similar entities representing the classes of the schema.Our second contribution aims at maintaining the generated schema consistent with the data source it describes, as this latter may evolve over time. We propose an incremental schema discovery approach that modifies the set of extracted classes by propagating the changes occurring at the source, in order to keep the schema consistent with its evolutions.Finally, the goal of our third contribution is to extend schema discovery to consider the whole semantics expressed by a data source, which is represented not only by the explicitly declared triples, but also by the ones which can be inferred through reasoning. We propose an extension allowing to take into account all the properties of an entity during schema discovery, represented either by explicit or by implicit triples, which will improve the quality of the generated schema.
Complete list of metadata

https://tel.archives-ouvertes.fr/tel-03526247
Contributor : Abes Star :  Contact
Submitted on : Friday, January 14, 2022 - 1:12:07 PM
Last modification on : Sunday, January 16, 2022 - 3:27:32 AM

File

92410_BOUHAMOUM_2021_archivage...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-03526247, version 1

Citation

Redouane Bouhamoum. Découverte automatique de schéma pour les données irrégulières et massives. Base de données [cs.DB]. Université Paris-Saclay, 2021. Français. ⟨NNT : 2021UPASG081⟩. ⟨tel-03526247⟩

Share

Metrics

Les métriques sont temporairement indisponibles