Désambiguïsation lexicale de l'arabe pour et par la traduction automatique

Abstract : This thesis concerns a study of Word Sense Disambiguation (WSD), which is a central task in natural language processing and that can improve applications such as machine translation or information extraction. Researches in word sense disambiguation predominantly concern the English language, because the majority of other languages lacks a standard lexical reference for the annotation of corpora, and also lacks sense annotated corpora for the evaluation, and more importantly for the construction of word sense disambiguation systems. In English, the lexical database wordnet is a long-standing de-facto standard used in most sense annotated corpora and in most WSD evaluation campaigns.Our contribution to this thesis focuses on several areas:first of all, we present a method for the automatic creation of sense annotated corpora for any language, by taking advantage of the large amount of wordnet sense annotated English corpora, and by using a machine translation system. This method is applied on Arabic and is evaluated, to our knowledge, on the only Arabic manually sense annotated corpus with wordnet: the Arabic OntoNotes 5.0, which we have semi-automatically enriched.Its evaluation is performed thanks to an implementation of two supervised word sense disambiguation systems that are trained on the corpora produced using our method. We hence propose a solid baseline for the evaluation of future Arabic word sense disambiguation systems, in addition to sense annotated Arabic corpora that we provide as a freely available resource.Secondly, we propose an in vivo evaluation of our Arabic word sense disambiguation system by measuring its contribution to the performance of the machine translation task.
Document type :
Theses
Complete list of metadatas

Cited literature [125 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02139438
Contributor : Abes Star <>
Submitted on : Friday, May 24, 2019 - 4:43:06 PM
Last modification on : Friday, October 25, 2019 - 1:27:07 AM

File

HADJ_SALAH_2018_Archivage.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02139438, version 1

Collections

STAR | LIG | CNRS | UGA

Citation

Marwa Hadj Salah. Désambiguïsation lexicale de l'arabe pour et par la traduction automatique. Traitement du texte et du document. Université Grenoble Alpes; Université de Sfax (Tunisie). Faculté des Sciences économiques et de gestion, 2018. Français. ⟨NNT : 2018GREAM089⟩. ⟨tel-02139438⟩

Share

Metrics

Record views

210

Files downloads

174