Skip to Main content Skip to Navigation
Theses

Training parsers for low-resourced languages : improving cross-lingual transfer with monolingual knowledge

Abstract : As a result of the recent blossoming of Machine Learning techniques, the Natural Language Processing field faces an increasingly thorny bottleneck: the most efficient algorithms entirely rely on the availability of large training data. These technological advances remain consequently unavailable for the 7,000 languages in the world, out of which most are low-resourced. One way to bypass this limitation is the approach of cross-lingual transfer, whereby resources available in another (source) language are leveraged to help building accurate systems in the desired (target) language. However, despite promising results in research settings, the standard transfer techniques lack the flexibility regarding cross-lingual resources needed to be fully usable in real-world scenarios: exploiting very sparse resources, or assorted arrays of resources. This limitation strongly diminishes the applicability of that approach. This thesis consequently proposes to combine multiple sources and resources for transfer, with an emphasis on selectivity: can we estimate which resource of which language is useful for which input? This strategy is put into practice in the frame of transition-based dependency parsing. To this end, a new transfer framework is designed, with a cascading architecture: it enables the desired combination, while ensuring better targeted exploitation of each resource, down to the level of the word. Empirical evaluation dampens indeed the enthusiasm for the purely cross-lingual approach -- it remains in general preferable to annotate just a few target sentences -- but also highlights its complementarity with other approaches. Several metrics are developed to characterize precisely cross-lingual similarities, syntactic idiosyncrasies, and the added value of cross-lingual information compared to monolingual training. The substantial benefits of typological knowledge are also explored. The whole study relies on a series of technical improvements regarding the parsing framework: this work includes the release of a new open source software, PanParser, which revisits the so-called dynamic oracles to extend their use cases. Several purely monolingual contributions complete this work, including an exploration of monolingual cascading, which offers promising perspectives with easy-then-hard strategies.
Complete list of metadata

Cited literature [333 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01838834
Contributor : Abes Star :  Contact Connect in order to contact the contributor
Submitted on : Friday, July 13, 2018 - 4:20:05 PM
Last modification on : Thursday, October 14, 2021 - 9:18:41 AM
Long-term archiving on: : Monday, October 15, 2018 - 8:21:19 PM

File

75576_AUFRANT_2018_archivage.p...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01838834, version 1

Citation

Lauriane Aufrant. Training parsers for low-resourced languages : improving cross-lingual transfer with monolingual knowledge. Document and Text Processing. Université Paris Saclay (COmUE), 2018. English. ⟨NNT : 2018SACLS089⟩. ⟨tel-01838834⟩

Share

Metrics

Record views

901

Files downloads

1259