Skip to Main content Skip to Navigation

Learning representations of speech from the raw waveform

Abstract : While deep neural networks are now used in almost every component of a speech recognition system, from acoustic to language modeling, the input to such systems are still fixed, handcrafted, spectral features such as mel-filterbanks. This contrasts with computer vision, in which a deep neural network is now trained on raw pixels. Mel-filterbanks contain valuable and documented prior knowledge from human auditory perception as well as signal processing, and are the input to state-of-the-art speech recognition systems that are now on par with human performance in certain conditions. However, mel-filterbanks, as any fixed representation, are inherently limited by the fact that they are not fine-tuned for the task at hand. We hypothesize that learning the low-level representation of speech with the rest of the model, rather than using fixed features, could push the state-of-the art even further. We first explore a weakly-supervised setting and show that a single neural network can learn to separate phonetic information and speaker identity from mel-filterbanks or the raw waveform, and that these representations are robust across languages. Moreover, learning from the raw waveform provides significantly better speaker embeddings than learning from mel-filterbanks. These encouraging results lead us to develop a learnable alternative to mel-filterbanks, that can be directly used in replacement of these features. In the second part of this thesis we introduce Time-Domain filterbanks, a lightweight neural network that takes the waveform as input, can be initialized as an approximation of mel-filterbanks, and then learned with the rest of the neural architecture. Across extensive and systematic experiments, we show that Time-Domain filterbanks consistently outperform melfilterbanks and can be integrated into a new state-of-the-art speech recognition system, trained directly from the raw audio signal. Fixed speech features being also used for non-linguistic classification tasks for which they are even less optimal, we perform dysarthria detection from the waveform with Time-Domain filterbanks and show that it significantly improves over mel-filterbanks or low-level descriptors. Finally, we discuss how our contributions fall within a broader shift towards fully learnable audio understanding systems.
Complete list of metadata

Cited literature [302 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, September 4, 2019 - 2:51:06 PM
Last modification on : Friday, June 24, 2022 - 3:15:38 AM
Long-term archiving on: : Thursday, February 6, 2020 - 8:54:08 AM


Version validated by the jury (STAR)


  • HAL Id : tel-02278616, version 1



Neil Zeghidour. Learning representations of speech from the raw waveform. Machine Learning [cs.LG]. Université Paris sciences et lettres, 2019. English. ⟨NNT : 2019PSLEE004⟩. ⟨tel-02278616⟩



Record views


Files downloads