Skip to Main content Skip to Navigation

Formalisation de connaissances à partir de corpus : modélisation linguistique du contexte pour l'extraction automatique de relations sémantiques

Ismaïl El Maarouf 1
VALORIA - Laboratoire de Recherche en Informatique et ses Applications de Vannes et Lorient
Abstract : Corpora, which are text collections selected for specific purposes, are playing an increasing role in Linguistics and Natural Language Processing (NLP). They are conceived as knowledge sources on natural language use, as much as knowledge on the entities designated by linguistic expressions, and they are used in particular to evaluate NLP application performances. The criteria prevailing on their constitution have an obvious, though still delicate to characterize, impact on (i) the major linguistic structures they contain, (ii) the knowledge conveyed, and, (iii) computational systems' success on a give task. This thesis studies methodologies of automatic extraction of semantic relations on written text corpora. Such a topic calls for a detailed examination of the context in which a given expression holds, as well as for the discovery of the features which determine its meaning, in order to be able to link semantic units. Generally, contextual models are built from the co-occurrence analysis of linguistic informations, drawn from resources and NLP tools. The benefits and limits of these informations are evaluated in a task of relation extraction from corpora belonging to different genres (press article, fairy tale, biography). The results show that these informations are insufficient to reach a satisfying semantic representation as well as to design robust systems. Two problems are particularly addressed. On the one hand, it seems indispensable to add informations related to text genre. So as to characterize the impact of genre on semantic relations, an automatic classification method, which relies on the semantic restrictions holding between verbs and nouns, is proposed. The method is experimented on a fairy tale corpus and on a press corpus. On the other hand, contextual models need to deal with problems which come under discourse surface variation. In a text, related linguistic expressions are not always close to one another and it is sometimes necessary to design complex algorithms in order to detect long dependencies. To answer this problem in a coherent manner, a method of discourse segmentation based on surface structure triggers in written corpora, is proposed. It paves the way for grammars operating on macro-syntactic categories in order to structure the discursive representation of a sentence. This method is applied prior to a syntactic analysis and its improvement is evaluated. The solutions proposed to these problems help us to approach Information Extraction from a particular angle : the implemented system is evaluated on a task of Named Entity correction in the context of a Question-Answering System. This specific need entails the alignment of a category definition on the type of answer expected by the question.
Complete list of metadatas
Contributor : Ismaïl El Maarouf <>
Submitted on : Monday, January 9, 2012 - 11:30:40 AM
Last modification on : Monday, October 19, 2020 - 11:01:45 AM
Long-term archiving on: : Tuesday, April 10, 2012 - 2:22:09 AM


  • HAL Id : tel-00657708, version 1



Ismaïl El Maarouf. Formalisation de connaissances à partir de corpus : modélisation linguistique du contexte pour l'extraction automatique de relations sémantiques. Linguistique. Université de Bretagne Sud, 2011. Français. ⟨tel-00657708⟩



Record views


Files downloads