Skip to Main content Skip to Navigation

Reconnaissance automatique de la parole à large vocabulaire : des approches hybrides aux approches End-to-End

Abstract : At Linagora, the OpenPaasNG project was launched in 2015 for a period of 4 years with the purpose of building a new generation of collaborative platform which provides a videoconferencing tool using artificial intelligence technologies such as speech transcription, keyword extraction and automatic meeting summary. In this context, my industrial thesis tackled Automatic Speech Recognition (ASR) for virtual meetings in real time and offiine. This thesis deals with the study of acoustic modeling methods for ASR. The upheaval of sequential neural architectures has nowadays allowed major advances in the improvement of modeling and learning of acoustic models. I explored in this work the two main families of RAP systems : hybrid and End-to-End systems. A first part of this thesis concerns traditional approaches and is dedicated to the implementation of a large vocabulary RAP system in French for spontaneous speech, which has been deployed in several industrial use cases. A large work of data collection, processing and standardization was carried out in a first step to reach the goal of 1000 hours of annotated speech. An evaluation of the acoustic, lexical and linguistic components is proposed to refine the choice and orientation of the hybrid DNN-HMM modeling for the French language. I proposed for this part an industrial platform for the adaptation of hybrid components called "LinSTT Model Factory" allowing an adaptation of the models to the conditions of use, namely : a particular acoustic context, a vocabulary specific to a target domain. In a second part, I approached the problem of automatically transcribing speech directly from acoustic observations. To do so, I conducted an in-depth study of the End-to-End RAP approaches : how can we learn sequential alignments between audio and text ? What type of architecture to use ? And especially, what type of output units to choose (character, word piece, word) ? I tried to answer to these questions with a set of experiments on the TIMIT and LibriSpeech datasets. A large part of this work has been conducted during a scientific stay at the Mila laboratory in Canada, where I actively contributed to the development of the open-source tool "SpeechBrain". In a third part, I report multi-task learning experiments using end-to-end systems in order to exploit several output representations, in our case characters and consonant / vowel categories. I proposed a new technique of combining thses representations for improving recognition performance.
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, March 22, 2022 - 4:17:21 PM
Last modification on : Monday, July 4, 2022 - 9:35:21 AM
Long-term archiving on: : Thursday, June 23, 2022 - 7:37:34 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03616588, version 1


Abdelwahab Heba. Reconnaissance automatique de la parole à large vocabulaire : des approches hybrides aux approches End-to-End. Intelligence artificielle [cs.AI]. Université Paul Sabatier - Toulouse III, 2021. Français. ⟨NNT : 2021TOU30116⟩. ⟨tel-03616588⟩



Record views


Files downloads