Skip to Main content Skip to Navigation

Un treebank pour le serbe : constitution et exploitations

Abstract : At the beginning of this PhD, no treebank for Serbian was available. However, manually annotated treebanks are an essential resource for developing (training and evaluating) statistical tools for syntactic analysis (parsers). Efficient parsers, in turn, facilitate the annotation of large corpora, which can be used as a basis for research in theoretical linguistics. The lack of these resources for Serbian slows down the research in these two directions. It also hinders the creation of digital resources for Serbian in general. In order to address this issue, we created a suite of NLP resources for Serbian. Firstly, we created the ParCoTrain-Synt treebank, a 101 000 token corpus, complete with morphosyntactic annotation, lemmatisation and syntactic dependency annotation. We also built the ParCoLex lexicon, containing 7 million entries for 157 000 different lemmas. Using these two resources, we trained models for parsing, morphosyntactic tagging and lemmatisation. All of the above resources are available at the following address : https: // We also used these resources in two experiments in Serbian linguistics, demonstrating that the ParCoTrain-Synt treebank is well suited to empirical studies based on quantitative data analysis.
Keywords : Treebank Serbian Parsing
Document type :
Complete list of metadata

Cited literature [336 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Thursday, May 28, 2020 - 12:10:09 PM
Last modification on : Friday, July 2, 2021 - 5:36:02 PM


  • HAL Id : tel-02639473, version 1



Aleksandra Miletic. Un treebank pour le serbe : constitution et exploitations. Linguistique. Université Toulouse le Mirail - Toulouse II, 2018. Français. ⟨NNT : 2018TOU20030⟩. ⟨tel-02639473⟩



Record views


Files downloads