Intégration de ressources lexicales riches dans un analyseur syntaxique probabiliste

Abstract : This thesis focuses on the integration of lexical and syntactic resources of French in two fundamental tasks of Natural Language Processing [NLP], that are probabilistic part-of-speech tagging and probabilistic parsing. In the case of French, there are a lot of lexical and syntactic data created by automatic processes or by linguists. In addition, a number of experiments have shown interest to use such resources in processes such as tagging or parsing, since they can significantly improve system performances. In this paper, we use these resources to give an answer to two problems that we describe briefly below : data sparseness and automatic segmentation of texts. Through more and more sophisticated parsing algorithms, parsing accuracy is becoming higher for many languages including French. However, there are several problems inherent in mathematical formalisms that statistically model the task (grammar, discriminant models,...). Data sparseness is one of those problems, and is mainly caused by the small size of annotated corpora available for the language. Data sparseness is the difficulty of estimating the probability of syntactic phenomena, appearing in the texts to be analyzed, that are rare or absent from the corpus used for learning parsers. Moreover, it is proved that spars ness is partly a lexical problem, because the richer the morphology of a language is, the sparser the lexicons built from a Treebank will be for that language. Our first problem is therefore based on mitigating the negative impact of lexical data sparseness on parsing performance. To this end, we were interested in a method called word clustering that consists in grouping words of corpus and texts into clusters. These clusters reduce the number of unknown words, and therefore the number of rare or unknown syntactic phenomena, related to the lexicon, in texts to be analyzed. Our goal is to propose word clustering methods based on syntactic information from French lexicons, and observe their impact on parsers accuracy. Furthermore, most evaluations about probabilistic tagging and parsing were performed with a perfect segmentation of the text, as identical to the evaluated corpus. But in real cases of application, the segmentation of a text is rarely available and automatic segmentation tools fall short of proposing a high quality segmentation, because of the presence of many multi-word units (compound words, named entities,...). In this paper, we focus on continuous multi-word units, called compound words, that form lexical units which we can associate a part-of-speech tag. We may see the task of searching compound words as text segmentation. Our second issue will therefore focus on automatic segmentation of French texts and its impact on the performance of automatic processes. In order to do this, we focused on an approach of coupling, in a unique probabilistic model, the recognition of compound words and another task. In our case, it may be parsing or tagging. Recognition of compound words is performed within the probabilistic process rather than in a preliminary phase. Our goal is to propose innovative strategies for integrating resources of compound words in both processes combining probabilistic tagging, or parsing, and text segmentation
Complete list of metadatas

Cited literature [174 references]  Display  Hide  Download
Contributor : Abes Star <>
Submitted on : Wednesday, February 27, 2013 - 5:22:31 PM
Last modification on : Wednesday, April 11, 2018 - 12:12:03 PM
Long-term archiving on : Sunday, April 2, 2017 - 6:30:17 AM


Version validated by the jury (STAR)


  • HAL Id : tel-00795309, version 1


Anthony Sigogne. Intégration de ressources lexicales riches dans un analyseur syntaxique probabiliste. Autre [cs.OH]. Université Paris-Est, 2012. Français. ⟨NNT : 2012PEST1106⟩. ⟨tel-00795309⟩



Record views


Files downloads