Indexation audio-visuelle des personnes dans un contexte de télévision

Abstract : With increasing internet use, the amount of multimedia content multiplies, making it necessary to develop technologies in order to enable users to browse through the multimedia data. One key element for browsing is the presence of people. However, structuring TV-Content in terms of people is a hard problem due to many difficulties in audio and visual modalities as well as in their association (short speaker turns, variations in facial expressions and pose, no synchronization between sequences of a person's appearance and sequences of his/her speech). The goal underlying this dissertation is to structure TV-Content by person in order to allow users to navigate through sequences in which a particular indivisual appears. To this end, most methods propose indexing people separately by the audio and visual information and then associating the results of each in order to obtain a talking-face index. Unfortunately, this type of approach combines clustering errors present in each modality. Our work seeks to capitalise on interactions between the audio and visual modalities rather than treating them separately. We propose a mutual correction scheme for audio and visual clustering errors. First, the clustering errors are detected using indicators that suspect a talking-face presence (Step 1). Then, the incorrect label is corrected according to an automatic modification scheme (Step 2). In more detail, first we proposed a Baseline system of talking faces indexing in which audio and visual indexes of people are generated independently by speaker and clothes clustering. Then, we proposed a fusion method based on maximizing global coverage of detected clusters. Results on a TV-show database show a high precision (90%), but with a significant missed-detection rate (only 55% of talking faces sequences are detected). In order to automatically detect a talking face presence (in the step 1), we exploited the fact that the lip-activity is strongly related to speech activity. We developed a new method for lip-activity detection in TV-Context based on the disorder of the directions of pixels. An evaluation is performed on manually annotated TV-Shows and significant improvement is observed compared to the state-of-the-art in TV-Contexts. Next, the modification method is based on the paradigm that one modality (either audio or visual) is more reliable than the other. We proposed two modification schemes: one based on systematic correction of the supposedly less reliable modality a priori while the second proposes comparing unsupervised audio-visual model scores to determine which modality failed. The unsupervised models are trained from the homogeneous sets of talking faces obtained automatically by the Baseline system. Experiments conducted on a TV-show database show that the proposed correction schemes yield significant improvement in performance, mainly due to an important reduction of missed talking-faces. We have investigated also on late fusion techniques for identity verification in biometric systems. We have proposed a fusion method based on the signal quality in each modality.
Complete list of metadatas

Cited literature [91 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/pastel-00661662
Contributor : Meriem Bendris <>
Submitted on : Friday, January 20, 2012 - 11:59:14 AM
Last modification on : Thursday, October 17, 2019 - 12:36:07 PM
Long-term archiving on : Saturday, April 21, 2012 - 2:26:37 AM

Identifiers

  • HAL Id : pastel-00661662, version 1

Collections

Citation

Meriem Bendris. Indexation audio-visuelle des personnes dans un contexte de télévision. Traitement du signal et de l'image [eess.SP]. Télécom ParisTech, 2011. Français. ⟨pastel-00661662⟩

Share

Metrics

Record views

792

Files downloads

662