Skip to Main content Skip to Navigation
Theses

On entity resolution in probabilistic data

Naser Ayat 1, 2, 3
2 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Entity resolution (ER) is the problem of identifying duplicate tuples, which are the tuples that represent the same real-world entity. There are many real-life applications in which the ER problem arises. These applications range from news aggregation websites, identifying the news that cover the same story, in order to avoid presenting one story several times to the user, to the integration of two companies' customer databases in the case of a merger, where identifying the tuples that refer to the same customer is crucial. Due to its diverse applications, the ER problem has been formulated in different ways in the literature. The two main ER's related problem formulations include: 1) identity resolution, and 2) reduplication. In identity resolution, the aim is to find duplicate(s) of a given tuple in a given database, while in deduplication, the aim is to find groups of duplicate tuples in a given database, and merge them in order to increase the quality of the database itself. The ER problem is however not limited to deterministic (ordinary) databases, rather it also arises in applications that deal with probabilistic databases, i.e. databases in which each tuple or attribute value is associated with a probability value to, for instance, indicate its confidence level. In this thesis, we study the ER problem in probabilistic databases. More specifically, we study the following problems: 1) identity resolution in probabilistic data, 2) identity resolution in distributed probabilistic data, 3) deduplication in probabilistic data, and 4) schema matching in a fully automated setting.
Document type :
Theses
Complete list of metadatas

Cited literature [272 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01073363
Contributor : Patrick Valduriez <>
Submitted on : Thursday, October 9, 2014 - 3:37:30 PM
Last modification on : Thursday, May 24, 2018 - 3:59:21 PM
Long-term archiving on: : Saturday, January 10, 2015 - 10:51:18 AM

Identifiers

  • HAL Id : tel-01073363, version 1

Collections

Citation

Naser Ayat. On entity resolution in probabilistic data. Databases [cs.DB]. Universiteit van Amsterdam, 2014. English. ⟨tel-01073363⟩

Share

Metrics

Record views

361

Files downloads

2492