Construction automatique d'outils et de ressources linguistiques à partir de corpus parallèles

Othman Zennaki 1, 2
2 LVIC - Laboratoire Vision et Ingénierie des Contenus
DIASI - Département Intelligence Ambiante et Systèmes Interactifs : DRT/LIST/DIASI
Abstract : This thesis focuses on the automatic construction of linguistic tools and resources for analyzing texts of low-resource languages. We propose an approach using Recurrent Neural Networks (RNN) and requiring only a parallel or multi-parallel corpus between a well-resourced language and one or more low-resource languages. This parallel or multi-parallel corpus is used to construct a multilingual representation of words of the source and target languages. We used this multilingual representation to train our neural models and we investigated both uni and bidirectional RNN models. We also proposed a method to include external information (for instance, low-level information from Part-Of-Speech tags) in the RNN to train higher level taggers (for instance, SuperSenses taggers and Syntactic dependency parsers). We demonstrated the validity and genericity of our approach on several languages and we conducted experiments on various NLP tasks: Part-Of-Speech tagging, SuperSenses tagging and Dependency parsing. The obtained results are very satisfactory. Our approach has the following characteristics and advantages: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger for N languages).
Complete list of metadatas

Cited literature [165 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02173773
Contributor : Abes Star <>
Submitted on : Thursday, July 4, 2019 - 4:33:06 PM
Last modification on : Friday, October 25, 2019 - 1:24:42 AM

File

ZENNAKI_2019_diffusion.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02173773, version 1

Collections

Citation

Othman Zennaki. Construction automatique d'outils et de ressources linguistiques à partir de corpus parallèles. Linguistique. Université Grenoble Alpes, 2019. Français. ⟨NNT : 2019GREAM006⟩. ⟨tel-02173773⟩

Share

Metrics

Record views

212

Files downloads

71