Les ressources annotées, un enjeu pour l'analyse de contenu : vers une méthodologie de l'annotation manuelle de corpus

Abstract : Manual corpus annotation has become a key issue for Natural Langage Processing (NLP), as manually annotated corpora are used both to create and to evaluate NLP tools. However, the process of manual annotation remains underdescribed and the tools used to support it are often misused. This situation prevents the campaign manager from evaluating and guarantying the quality of the annotation. We propose in this work a unified vision of manual corpus annotation for NLP. It results from our experience of annotation campaigns, either as a manager or as a participant, as well as from collaborations with other researchers. We first propose a global methodology for managing manual corpus annotation campaigns, that relies on two pillars: an organization for annotation campaigns that puts evaluation at the heart of the process and an innovative grid for the analysis of the complexity dimensions of an annotation campaign. A second part of our work concerns the tools of the campaign manager. We evaluated the precise influence of automatic pre-annotation on the quality and speed of the correction by humans, through a series of experiments on part-of-speech tagging for English. Furthermore, we propose practical solutions for the evaluation of manual annotations, that provide the campaign manager with the means to select the most appropriate measures. Finally, we brought to light the processes and tools involved in an annotation campaign and we instantiated the methodology that we described.
Document type :
Complete list of metadatas

Cited literature [180 references]  Display  Hide  Download

Contributor : Karën Fort <>
Submitted on : Wednesday, July 3, 2013 - 3:06:01 PM
Last modification on : Friday, September 6, 2019 - 11:48:11 AM
Long-term archiving on : Friday, October 4, 2013 - 4:10:56 AM


  • HAL Id : tel-00797760, version 2



Karen Fort. Les ressources annotées, un enjeu pour l'analyse de contenu : vers une méthodologie de l'annotation manuelle de corpus. Traitement du texte et du document. Université Paris-Nord - Paris XIII, 2012. Français. ⟨tel-00797760v2⟩



Record views


Files downloads