Skip to Main content Skip to Navigation

A Data-driven Approach to Natural Language Processing for Contemporary and Historical French

Abstract : In recent years, neural methods for Natural Language Processing (NLP) have consistently and repeatedly improved the state of the art in a wide variety of NLP tasks. One of the main contributing reasons for this steady improvement is the increased use of transfer learning techniques. These methods consist in taking a pre-trained model and reusing it, with little to no further training, to solve other tasks. Even though these models have clear advantages, their main drawback is the amount of data that is needed to pre-train them. The lack of availability of large-scale data previously hindered the development of such models for contemporary French, and even more so for its historical states.In this thesis, we focus on developing corpora for the pre-training of these transfer learning architectures. This approach proves to be extremely effective, as we are able to establish a new state of the art for a wide range of tasks in NLP for contemporary, medieval and early modern French as well as for six other contemporary languages. Furthermore, we are able to determine, not only that these models are extremely sensitive to pre-training data quality, heterogeneity and balance, but we also show that these three features are better predictors of the pre-trained models' performance in downstream tasks than the pre-training data size itself. In fact, we determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that such models can be pre-trained with corpora of a modest size.
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, September 7, 2022 - 10:21:18 AM
Last modification on : Friday, September 9, 2022 - 3:49:40 AM


Version validated by the jury (STAR)


Distributed under a Creative Commons Attribution 4.0 International License


  • HAL Id : tel-03770337, version 2


Pedro Ortiz Suarez. A Data-driven Approach to Natural Language Processing for Contemporary and Historical French. Document and Text Processing. Sorbonne Université, 2022. English. ⟨NNT : 2022SORUS155⟩. ⟨tel-03770337v2⟩



Record views


Files downloads