Identification non-supervisée de personnes dans les flux télévisés

Abstract : In this thesis we propose several methods for unsupervised person identification in TV broadcast using the names written on the screen. As the use of biometric models to recognize people in large video collections is not a viable option without a priori knowledge of people present in this videos, several methods of the state-of-the-art proposes to use other sources of information to get the names of those present. These methods mainly use the names pronounced as source of names. However, we can not have a good confidence in this source due to transcription or detection names errors and also due to the difficulty of knowing to who refers a pronounced name. The names written on the screen in TV broadcast have not be used in the past due to the difficulty of extracting these names in low quality videos. However, recent years have seen improvements in the video quality and overlay text integration. We therefore re-evaluated in this thesis, the use of this source of names. We first developed LOOV (for LIG Overlaid OCR in Video), this tool extract overlaid texts written in video. With this tool we obtained a very low character error rate. This allows us to have an important confidence in this source of names. We then compared the written names and pronounced names in their ability to provide the names of person present in TV broadcast. We found that twice persons are nameable by written names than by pronounced names with an automatic extraction of them. Another important point to note is that the association between a name and a person is inherently easier for written names than for pronounced names. With this excellent source of names we were able to develop several unsupervised naming methods of people in TV broadcast. We started with late naming methods where names are propagated onto speaker clusters. These methods question differently the choices made during the diarization process. We then proposed two methods (integrated naming and early naming) that incorporate more information from written names during the diarization process. To identify people appear on screen, we adapted the early naming method for faces clusters. Finally, we have also shown that this method also works for multi-modal speakers-faces clusters. With the latter method, that named speech turn and face during a single process, we obtain comparable score to the best systems that contribute during the first evaluation REPERE
Document type :
Complete list of metadatas

Cited literature [99 references]  Display  Hide  Download
Contributor : Michel Vacher <>
Submitted on : Thursday, May 22, 2014 - 3:59:26 PM
Last modification on : Thursday, October 11, 2018 - 8:48:02 AM
Long-term archiving on : Friday, August 22, 2014 - 10:45:30 AM


  • HAL Id : tel-00958774, version 1


Johann Poignant. Identification non-supervisée de personnes dans les flux télévisés. Autre [cs.OH]. Université de Grenoble, 2013. Français. ⟨NNT : 2013GRENM053⟩. ⟨tel-00958774⟩



Record views


Files downloads