Learning representations of speech from the raw waveform

Abstract : While deep neural networks are now used in almost every component of a speech recognition system, from acoustic to language modeling, the input to such systems are still fixed, handcrafted, spectral features such as mel-filterbanks. This contrasts with computer vision, in which a deep neural network is now trained on raw pixels. Mel-filterbanks contain valuable and documented prior knowledge from human auditory perception as well as signal processing, and are the input to state-of-the-art speech recognition systems that are now on par with human performance in certain conditions. However, mel-filterbanks, as any fixed representation, are inherently limited by the fact that they are not fine-tuned for the task at hand. We hypothesize that learning the low-level representation of speech with the rest of the model, rather than using fixed features, could push the state-of-the art even further. We first explore a weakly-supervised setting and show that a single neural network can learn to separate phonetic information and speaker identity from mel-filterbanks or the raw waveform, and that these representations are robust across languages. Moreover, learning from the raw waveform provides significantly better speaker embeddings than learning from mel-filterbanks. These encouraging results lead us to develop a learnable alternative to mel-filterbanks, that can be directly used in replacement of these features. In the second part of this thesis we introduce Time-Domain filterbanks, a lightweight neural network that takes the waveform as input, can be initialized as an approximation of mel-filterbanks, and then learned with the rest of the neural architecture. Across extensive and systematic experiments, we show that Time-Domain filterbanks consistently outperform melfilterbanks and can be integrated into a new state-of-the-art speech recognition system, trained directly from the raw audio signal. Fixed speech features being also used for non-linguistic classification tasks for which they are even less optimal, we perform dysarthria detection from the waveform with Time-Domain filterbanks and show that it significantly improves over mel-filterbanks or low-level descriptors. Finally, we discuss how our contributions fall within a broader shift towards fully learnable audio understanding systems.
Complete list of metadatas

Cited literature [302 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02278616
Contributor : Abes Star <>
Submitted on : Wednesday, September 4, 2019 - 2:51:06 PM
Last modification on : Thursday, September 5, 2019 - 9:35:08 AM

File

Zeghidour-2016-These.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02278616, version 1

Collections

Citation

Neil Zeghidour. Learning representations of speech from the raw waveform. Machine Learning [cs.LG]. PSL Research University, 2019. English. ⟨NNT : 2019PSLEE004⟩. ⟨tel-02278616⟩

Share

Metrics

Record views

67

Files downloads

11