Contributions to audio source separation and content description

Emmanuel Vincent 1
1 METISS - Speech and sound data modeling and processing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Audio data occupy a central position in our life, whether it is for spoken communication, personal videos, radio and television, music, cinema, video games, or live entertainment. This raises a range of application needs from signal enhancement to information retrieval, including content repurposing and interactive manipulation. Real-world audio data exhibit a complex structure due to the superposition of several sound sources and the coexistence of several layers of information. For instance, speech recordings often include concurrent speakers or background noise and they carry information about the speaker identity, the language and the topic of the discussion, the uttered text, the intonation and the acoustic environment. Music recordings also typically consist of several musical instruments or voices and they carry information about the composer, the temporal organization of music, the underlying score, the interpretation of the performer and the acoustic environment. When I started my PhD in 2001, the separation of the source signals in a given recording was considered as one of the greatest challenges towards successful application to real-world data of audio processing techniques originally designed for single-source data. Fixed or adaptive beamforming techniques for target signal enhancement were already established, but they required a large number of microphones which is rarely available in practice. Blind source separation techniques designed for a smaller number of microphones had just started to be applied to audio. Eleven years later, much progress has been made and source separation has become a mature topic. Thanks in particular to some of the contributions listed in this document, the METISS team has gained a leading reputation in the field, as exemplified by a growing number of technology transfer collaborations aiming to enhance and remix speech and music signals in various use cases. The use of source separation as a pre-processing step for the description of individual speech or music sources within a mixture raises the additional challenge of efficiently dealing with nonlinear distortions over the estimated source signals. Robust methods interfacing source separation, feature extraction and classification have emerged in the last ten years based on the idea of uncertainty propagation. This topic was part of my research program when I joined Inria in 2006 and it is currently undergoing major growth due to the ubiquity of speech applications for hand-held devices. Current methods have not yet reached the robustness of the human auditory system, though, and speech or speaker recognition in real-world non-stationary noise environments remains a very challenging problem. By comparison with the above two challenges, joint processing of the multiple layers of information underlying audio signals has attracted less interest to date. It remains however a fundamental problem for music processing in particular, where tasks such as polyphonic pitch transcription and chord identification are typically performed independently of each other without accounting for the strong links between pitch and chord information. My work has been focusing on these three challenges and is based in particular on the theoretical foundations of Bayesian modeling and estimation on the one hand and sparse modeling and convex optimization on the other hand. This document provides an overview of my contributions since the end of my PhD along four axes: Chapter 1 is devoted to the formalization and diagnostic assessment of certain studied problems, Chapter 2 to linear modeling of audio signals and to some associated algorithms, Chapter 3 to variance modeling of audio signals and to some associated algorithms, and Chapter 4 to the description of multisource and multilayer contents. Chapter 5 summarizes the research perspectives arising from this work.
Document type :
Habilitation à diriger des recherches
Liste complète des métadonnées
Contributor : Emmanuel Vincent <>
Submitted on : Wednesday, September 18, 2013 - 3:07:12 PM
Last modification on : Thursday, March 21, 2019 - 2:20:12 PM
Document(s) archivé(s) le : Thursday, April 6, 2017 - 10:05:11 PM


  • HAL Id : tel-00758517, version 2


Emmanuel Vincent. Contributions to audio source separation and content description. Signal and Image processing. Université Rennes 1, 2012. ⟨tel-00758517v2⟩



Record views


Files downloads