Skip to Main content Skip to Navigation
Theses

Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit

Abstract : In this thesis we explore robust statistical syntax analysis for French. Our main concern is to explore methods whereby the linguist can inject linguistic knowledge and/or resources into the robust statistical engine in order to improve results for specific phenomena. We first explore the dependency annotation schema for French, concentrating on certain phenomena. Next, we look into the various algorithms capable of producing this annotation, and in particular on the transition-based parsing algorithm used in the rest of this thesis. After exploring supervised machine learning algorithms for NLP classification problems, we present the Talismane toolkit for syntax analysis, built within the framework of this thesis, including four statistical modules - sentence boundary detection, tokenisation, pos-tagging and parsing - as well as the various linguistic resources used for the baseline model, including corpora, lexicons and feature sets. Our first experiments attempt various machine learning configurations in order to identify the best baseline. We then look into improvements made possible by a beam search and beam propagation. Finally, we present a series of experiments aimed at correcting errors related to specific linguistic phenomena, using targeted features. One of our innovations is the introduction of rules that can impose or prohibit certain decisions locally, thus bypassing the statistical model. We explore the usage of rules for errors that the features are unable to correct. Finally, we look into the enhancement of targeted features by large scale linguistic resources, and in particular a semi-supervised approach using a distributional semantic resource.
Complete list of metadatas

Cited literature [98 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00979681
Contributor : Assaf Urieli <>
Submitted on : Wednesday, April 16, 2014 - 8:10:02 PM
Last modification on : Wednesday, October 14, 2020 - 3:43:58 AM
Long-term archiving on: : Monday, April 10, 2017 - 2:31:17 PM

Identifiers

  • HAL Id : tel-00979681, version 1

Citation

Assaf Urieli. Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit. Computation and Language [cs.CL]. Université Toulouse le Mirail - Toulouse II, 2013. English. ⟨tel-00979681⟩

Share

Metrics

Record views

682

Files downloads

1780