Skip to Main content Skip to Navigation
Theses

Analyse automatique par transitions pour l'identification des expressions polylexicales

Abstract : This thesis focuses on the identification of multi-word expressions, addressed through a transition-based system. A multi-word expression (MWE) is a linguistic construct composed of several elements whose combination shows irregularity at one or more linguistic levels. Identifying MWEs in context amounts to annotating the occurrences of MWEs in texts, i.e. to detecting sets of tokens forming such occurrences. For example, in the sentence This has nothing to do with the book, the tokens has, to, do and with would be marked as forming an occurrence of the MWE have to do with. Transition-based analysis is a famous NLP technique to build a structured output from a sequence of elements, applying a sequence of actions (called «transitions») chosen from a predefined set, to incrementally build the output structure. In this thesis, we propose a transition system dedicated to MWE identification within sentences represented as token sequences, and we study various architectures for the classifier which selects the transitions to apply to build the sentence analysis. The first variant of our system uses a linear support vector machine (SVM) classifier. The following variants use neural models: a simple multilayer perceptron (MLP), followed by variants integrating one or more recurrent layers. The preferred scenario is an identification of MWEs without the use of syntactic information, even though we know the two related tasks. We further study a multitasking approach, which jointly performs and take mutual advantage of morphosyntactic tagging, transition-based MWE identification and dependency parsing. The thesis comprises an important experimental part. Firstly, we studied which resampling techniques allow good learning stability despite random initializations. Secondly, we proposed a method for tuning the hyperparameters of our models by trend analysis within a random search for a hyperparameter combination. We produce systems with the constraint of using the same hyperparameter combination for different languages. We use data from the two PARSEME international competitions for verbal MWEs. Our variants produce very good results, including state-of-the-art scores for many languages in the PARSEME 1.0 and 1.1 datasets. One of the variants ranked first for most languages in the PARSEME 1.0 shared task. By the way, our models have poor performance on MWEs that are were not seen at learning time.
Complete list of metadatas

Cited literature [151 references]  Display  Hide  Download

https://hal.univ-lorraine.fr/tel-02527921
Contributor : Thèses Ul <>
Submitted on : Wednesday, April 1, 2020 - 3:21:09 PM
Last modification on : Thursday, April 2, 2020 - 1:49:11 AM

File

DDOC_T_2019_0206_AL_SAIED.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-02527921, version 1

Collections

Citation

Hazem Al Saied. Analyse automatique par transitions pour l'identification des expressions polylexicales. Traitement du texte et du document. Université de Lorraine, 2019. Français. ⟨NNT : 2019LORR0206⟩. ⟨tel-02527921⟩

Share

Metrics

Record views

79

Files downloads

135