Les collections volumineuses de documents audiovisuels : segmentation et regroupement en locuteurs

Abstract : The task of speaker diarization, as defined by NIST, considers the recordings from a corpus as independent processes. The recordings are processed separately, and the overall error rate is a weighted average. In this context, detected speakers are identified by anonymous labels specific to each recording. Therefore, a speaker appearing in several recordings will be identified by a different label in each of the recordings. Yet, this situation is very common in broadcast news data: hosts, journalists and other guests may appear recurrently. Consequently, speaker diarization has been recently considered in a broader context, where recurring speakers must be uniquely identified in every recording that compose a corpus. This generalization of the speaker partitioning problem goes hand in hand with the emergence of the concept of collections, which refers, in the context of speaker diarization, to a set of recordings sharing one or more common characteristics.The work proposed in this thesis concerns speaker clustering of large audiovisual collections (several tens of hours of recordings). The main objective is to propose (or adapt) clustering approaches in order to efficiently process large volumes of data, while detecting recurrent speakers. The effectiveness of the proposed approaches is discussed from two point of view: first, the quality of the produced clustering (in terms of error rate), and secondly, the time required to perform the process. For this purpose, we propose two architectures designed to perform cross-show speaker diarization with collections of recordings. We propose a simplifying approach to decomposing a large clustering problem in several independent sub-problems. Solving these sub-problems is done with either of two clustering approaches which takeadvantage of the recent advances in speaker modeling.
Document type :
Complete list of metadatas

Contributor : Abes Star <>
Submitted on : Wednesday, January 20, 2016 - 5:23:05 PM
Last modification on : Tuesday, December 19, 2017 - 3:11:52 AM
Long-term archiving on : Thursday, April 21, 2016 - 11:17:06 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01259649, version 1



Grégor Dupuy. Les collections volumineuses de documents audiovisuels : segmentation et regroupement en locuteurs. Informatique et langage [cs.CL]. Université du Maine, 2015. Français. ⟨NNT : 2015LEMA1006⟩. ⟨tel-01259649⟩



Record views


Files downloads