Liage de données ouvertes et hétérogènes : application au domaine musical

Abstract : This thesis is part of the ANR DOREMUS project. We are interested in the catalogs of three cultural institutions: BNF (Bibliothèque Nationale de France), Philharmonie de Paris and Radio France, containing detailed descriptions about music works. These institutions have adopted the Semantic Web technologies with the aim of making these data accessible to all and linked.The links creation becomes particularly difficult considering the high heterogeneity between the descriptions of the same entity. In this thesis, our main objective is to propose a generic data linking approach, dealing with certain challenges, for a concrete application on DOREMUS data. We focus on three major challenges: (1) reducing the tool configuration effort, (2) coping with different kinds of data heterogeneities across datasets and (3) dealing with datasets containing blocks of highly similar instances. Some of the existing linking approaches often require the user intervention during the linking process to configure some parameters. This may be a costly task for theuser who may not be an expert in the domain. Therefore, one of the researchquestions that arises is how to reduce human intervention as much as possible inthe process of data linking. Moreover, the data can show various heterogeneitiesthat a linking tool has to deal with. The descriptions can be expressed in differentnatural languages, with different vocabularies or with different values. The comparison can be complicated due to the variations according to three dimensions: value-based, ontological and logical. Another challenge is the distinction between highly similar but not equivalent resource descriptions. In their presence, most of the existing tools are reduced in efficiency generating false positive matches. In this perspective, some approaches have been proposed to identify a set of discriminative properties called keys. Very often, such approaches discover a very large number of keys. The question that arises is whether all keys can discover the same pairs of equivalent instances, or ifsome are more meaningful than others. No approach provides a strategy to classify the keys generated according to their effectiveness to discover the correct links.We developed Legato — a generic tool for automatic heterogeneous data linking.It is based on instance profiling to represent each resource as a textual documentof literals dealing with a variety of data heterogeneities. It implementsa filtering step of so-called problematic properties allowing to clean the data ofthe noise likely to make the comparison task difficult. To address the problem ofsimilar but distinct resources, Legato implements a key ranking algorithm calledRANKey.
Document type :
Complete list of metadatas

Cited literature [96 references]  Display  Hide  Download
Contributor : Abes Star <>
Submitted on : Thursday, June 7, 2018 - 5:40:10 PM
Last modification on : Wednesday, January 30, 2019 - 4:18:23 PM
Long-term archiving on : Saturday, September 8, 2018 - 3:11:38 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01810363, version 1


Manel Achichi. Liage de données ouvertes et hétérogènes : application au domaine musical. Other [cs.OH]. Université Montpellier, 2018. English. ⟨NNT : 2018MONTS002⟩. ⟨tel-01810363⟩



Record views


Files downloads