Skip to Main content Skip to Navigation
Theses

Approches supervisées et faiblement supervisées pour l’extraction d’événements et le peuplement de bases de connaissances

Abstract : The major part of the information available on the web is provided in textual form, i.e. in unstructured form. In a context such as technology watch, it is useful to present the information extracted from a text in a structured form, reporting only the pieces of information that are relevant to the considered field of interest. Such processing cannot be performed manually at large scale, given the large amount of data available. The automated processing of this task falls within the Information extraction (IE) domain.The purpose of IE is to identify, within documents, pieces of information related to facts (or events) in order to store this information in predefined data structures. These structures, called templates, aggregate fact properties - often represented by named entities - concerning an event or an area of interest.In this context, the research performed in this thesis addresses two problems:identifying information related to a specific event, when the information is scattered across a text and several events of the same type are mentioned in the text;reducing the dependency to annotated corpus for the implementation of an Information Extraction system.Concerning the first problem, we propose an original approach that relies on two steps. The first step operates an event-based text segmentation, which identifies within a document the text segments on which the IE process shall focus to look for the entities associated with a given event. The second step focuses on template filling and aims at selecting, within the segments identified as relevant by the event-based segmentation, the entities that should be used as fillers, using a graph-based method. This method is based on a local extraction of relations between entities, that are merged in a relation graph. A disambiguation step is then performed on the graph to identify the best candidates to fill the information template.The second problem is treated in the context of knowledge base (KB) population, using a large collection of texts (several millions) from which the information is extracted. This extraction also concerns a large number of relation types (more than 40), which makes the manual annotation of the collection too expensive. We propose, in this context, a distant supervision approach in order to use learning techniques for this extraction, without the need of a fully annotated corpus. This distant supervision approach uses a set of relations from an existing KB to perform an unsupervised annotation of a collection, from which we learn a model for relation extraction. This approach has been evaluated at a large scale on the data from the TAC-KBP 2010 evaluation campaign.
Document type :
Theses
Complete list of metadata

Cited literature [170 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00686811
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, April 11, 2012 - 11:52:26 AM
Last modification on : Thursday, February 17, 2022 - 10:08:04 AM
Long-term archiving on: : Monday, November 26, 2012 - 1:17:21 PM

File

VD2_JEAN-LOUIS_LUDOVIC_1512201...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00686811, version 1

Collections

CEA | STAR | DRT | LIST

Citation

Ludovic Jean-Louis. Approches supervisées et faiblement supervisées pour l’extraction d’événements et le peuplement de bases de connaissances. Autre [cs.OH]. Université Paris Sud - Paris XI, 2011. Français. ⟨NNT : 2011PA112288⟩. ⟨tel-00686811⟩

Share

Metrics

Record views

335

Files downloads

946