Apprentissage incrémental pour la construction de bases lexicales évolutives : application en désambiguïsation d'entités nommées

Thomas Girault 1, 2
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Some natural language processing applications have to deal with textual data streams characterized by the use of an evolving vocabulary, whether at the creation of words as at the change in the meaning of already existing words. In light of those observations, we have developed an incremental algorithm which can build automatically an evolving lexical database for identifying lexical units observed in a textual data stream. We used a concept lattice to build the lexical database from semantically unlabelled corpus. It allows us to infer formal concepts (similar to meaning units) organized into several granularity levels ranging from very specific to very general. This structured representation is completed with a cartographic model taking into account the continuous aspects of meaning and semantic proximity between concepts. This property is exploited to propagate the classification of a small number of named entities (NEs : lexical units which usually refer to people, places, organizations...) to others NEs observed in unlabelled data streams during the incremental construction of the lattice. Once the lexical database is built, the concepts are enriched with NEs labels observed in a training corpus. The concepts and their attached labels are then respectively used for unsupervised annotation and supervised classification of NEs in test corpus.
Complete list of metadatas

Cited literature [7 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00867236
Contributor : Pascale Sébillot <>
Submitted on : Friday, September 27, 2013 - 6:04:58 PM
Last modification on : Thursday, November 15, 2018 - 11:57:44 AM
Long-term archiving on: Saturday, December 28, 2013 - 4:33:17 AM

Identifiers

  • HAL Id : tel-00867236, version 1

Citation

Thomas Girault. Apprentissage incrémental pour la construction de bases lexicales évolutives : application en désambiguïsation d'entités nommées. Traitement du texte et du document. Université Rennes 1, 2010. Français. ⟨tel-00867236⟩

Share

Metrics

Record views

394

Files downloads

426