?. M. Bendris, D. Charlet, and G. Chollet, Introduction of quality measures in audio-visual identity verification, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009.
DOI : 10.1109/ICASSP.2009.4959983

?. M. Bendris, Introduction of indexing people problematic in TV-Content. Seminar on Information, Signal, Images et Vision : Indexation scalable et Cross Media, 2009.

?. M. Bendris, D. Charlet, and G. Chollet, Talking Faces indexing in TV-Content. International Workshop on Content-Based Multimedia Indexing (CBMI), 2010.

?. M. Bendris, D. Charlet, and G. Chollet, Lip activity detection for talking faces classification in TV-Content, International Conference in Machine Vision (ICMV). Hong Kong, 2010.

?. J. Carrive, J. Razik, M. Bendris, S. Vanni, L. Rigouste et al., Technologies d'indexation pour la valorisation du patrimoine audiovisuel, 2011.

?. M. Bendris, D. Charlet, and G. Chollet, People indexing in TV-Content using lip-activity and unsupervised audio-visual identity verification. International Workshop on Content-Based Multimedia Indexing (CBMI), p.167, 2011.

. .. Statistiques-des-annotations-du-corpus-grandéchiquiergrand´grandéchiquier, .. Le-grandéchiquier-grand´grandéchiquier, and .. Le-grandéchiquiergrand´grandéchiquier, 156 8.3 ´ Evaluation des résultats de l'indexation en locuteurs sur 157 8.4 ´ Evaluation des résultats de la structuration par le costume sur, p.158

P. Bibliographie-]-aarabi, The fusion of distributed microphone arrays for sound localization, EURASIP Journal on Applied Signal Processing, vol.4, pp.338-347, 2003.

. Acosta, An automatic face detection and recognition system for video indexing applications, Internationl conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.3644-3647, 2002.

Z. Arandjelovic, O. Arandjelovic, and A. Zisserman, Automatic Face Recognition for Film Character Retrieval in Feature-Length Films, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp.860-867, 2005.
DOI : 10.1109/CVPR.2005.81

K. Bailly and M. Milgram, Head pan angle estimation by a nonlinear regression on selected features, 2009 16th IEEE International Conference on Image Processing (ICIP), 2009.
DOI : 10.1109/ICIP.2009.5414310

. Bailly-bailliére, The BANCA Database and Evaluation Protocol, 4th International Conference on Audio-and Video-Based Biometric Person Authentication, pp.625-638, 2003.
DOI : 10.1007/3-540-44887-X_74

. Blouet, Becars : a free software for speaker verification. ODYSSEY -The Speaker and Language Recongnition Workshop, pp.145-148, 2004.

. Boccignone, Foveated shot detection for video segmentation, IEEE Transactions on Circuits and Systems for Video Technology, pp.365-377, 2005.
DOI : 10.1109/TCSVT.2004.842603

R. Boreczky, J. S. Boreczky, and L. A. Rowe, Comparison of video shot boundary detection techniques, Journal of Electronic Imaging, vol.5, issue.2, p.122, 1996.
DOI : 10.1117/12.238675

. Bourel, Robust Facial Feature Tracking, Procedings of the British Machine Vision Conference 2000, pp.232-241, 2000.
DOI : 10.5244/C.14.24
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.144.2333

H. Bowen, Z. Bowen, and J. Hansen, Efficient audio stream segmentation via the combined T/sup 2/ statistic and Bayesian information criterion, IEEE Transactions on Speech and Audio Processing, vol.13, issue.4, pp.467-474, 2005.
DOI : 10.1109/TSA.2005.845790

H. Bredin, Verification de l'identite d'un visage parlant Apport de la mesure de synchronie audiovisuelle fac aux tentatives deliberees d'imposture, 2007.

. Bregler, Improving acoustic speaker verification with visual Body-Language features, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.1909-1912, 2009.
DOI : 10.1109/ICASSP.2009.4959982
URL : http://cims.nyu.edu/%7Ebregler/ICASSP09/bregler_icassp09.pdf

. Cernekova, Video shot segmentation using singular value decomposition, International Conference on Multimedia and Expo, pp.301-304, 2003.
DOI : 10.1109/icme.2003.1221613
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.220.2309

G. Chen, S. S. Chen, P. S. Gopalakrishnan, and . Speaker, environment and channel change detection and clustering via the bayesian information criterion, Proc DARPA Broadcast News Transcription and Understanding Workshop, pp.127-132

. Chiang, A novel method for detecting lips, eyes and faces in real time, Real-Time Imaging, vol.9, issue.4, pp.277-287, 2003.
DOI : 10.1016/j.rti.2003.08.003

. Comaniciu, Kernel-based object tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.25, issue.5, pp.564-577, 2003.
DOI : 10.1109/TPAMI.2003.1195991

. Cootes, Active appearance models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.23, issue.6, pp.681-685, 2001.
DOI : 10.1109/34.927467

. Cootes, Active shape models-their training and application. Computer Vision and Image Understanding, pp.38-59, 1995.
DOI : 10.1006/cviu.1995.1004
URL : https://www.escholar.manchester.ac.uk/api/datastream?publicationPid=uk-ac-man-scw:1d1862&datastreamId=POST-PEER-REVIEW-PUBLISHERS.PDF

. Cour, Learning from ambiguously labeled images, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.919-926, 2009.
DOI : 10.1109/CVPR.2009.5206667
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.1111

. Dai, . Nakano, Y. Dai, and Y. Nakano, Face-texture model based on SGLD and its application in face detection in a color scene, Pattern Recognition, vol.29, issue.6, pp.1007-1017, 1996.
DOI : 10.1016/0031-3203(95)00139-5

N. Dalal and W. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp.886-893, 2004.
DOI : 10.1109/CVPR.2005.177
URL : https://hal.archives-ouvertes.fr/inria-00548512

. Dempster, Maximum likelihood from incomplete data via the em algorithm, The Royal Statistical Society Series B Methodological, vol.39, pp.1-38, 1977.

G. Duffner, S. Duffner, and C. Garcia, A connexionist approach for robust and precise facial feature detection in complex scenes, ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005., 2005.
DOI : 10.1109/ISPA.2005.195430

L. Dupont, S. Dupont, and J. Luettin, Audio-visual speech modeling for continuous speech recognition, IEEE Transactions on Multimedia, vol.2, issue.3, 2000.
DOI : 10.1109/6046.865479

!. Hello and . My-name-is, buffy " ? automatic naming of characters in tv video, The British Machine Vision Conference (BMVC), pp.1-10

. Fasel, A generative framework for real time object detection and classification, Computer Vision and Image Understanding, vol.98, issue.1, pp.182-210, 2005.
DOI : 10.1016/j.cviu.2004.07.014

. Féraud, A fast and accurate face detector based on neural networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.23, issue.1, pp.42-53, 2001.
DOI : 10.1109/34.899945

. Ferrari, Progressive search space reduction for human pose estimation, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2008.
DOI : 10.1109/CVPR.2008.4587468

. Fierrez-aguilar, Discriminative multimodal biometric authentication based on quality measures, Pattern Recognition, vol.38, issue.5, pp.777-779, 2005.
DOI : 10.1016/j.patcog.2004.11.012

. Galliano, The ES- TER2 evaluation campaign for the rich transcription of French radio broadcast, 10th Annual Conference of the International Speech Communication Association (Interspeech, 2009.

D. Garcia, C. Garcia, and M. Delakis, A neural architecture for fast and robust face detection. Object recognition supported by user interaction for service robots, pp.44-47, 2002.

. Garofolo, The nist meeting room pilot corpus, 4th Conference on Language Resources and Evaluation (LREC), 2004.

. Genoud, Combining methods to improve speaker verification decision, Proceeding of Fourth International Conference on Spoken Language Processing ICSLP 96 ICSLP-96, pp.1756-1759, 1996.
DOI : 10.1109/ICSLP.1996.607968
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4619

. Geoffrois, Corpus description of the ester evaluation campaign for the rich transcription of french broadcast news, 5th international conference on Language Resources and Evaluation (LREC), 2006.

. Goldberger, J. Goldberger, and S. Roweis, Hierarchical clustering of a mixture model, Advances in Neural Information Processing Systems, pp.505-512, 2004.

L. Hall, D. L. Hall, and J. Llinas, An introduction to multisensor data fusion, Proceedings of the IEEE, vol.85, issue.1, pp.6-23, 1997.
DOI : 10.1109/5.554205

. Heckmann, A hybrid ann/hmm audio-visual speech recognition system, International Conference on AuditoryVisual Speech Processing Proceedings (AVSP), 2001.

. Heracleous, Exploiting multimodal data fusion in robust speech recognition, 2010 IEEE International Conference on Multimedia and Expo, 2010.
DOI : 10.1109/ICME.2010.5583086
URL : https://hal.archives-ouvertes.fr/hal-00508288

G. Jaffré, Indexation de la vidéo par le costume, 2005.

J. Jaffre, G. Jaffre, and P. Joly, Costume : A new feature for automatic video content indexing, Coupling approaches, coupling media and coupling languages for information retrieval, pp.314-325, 2004.

S. C. Johnson, Hierarchical clustering schemes, Psychometrika, vol.58, issue.4, pp.241-254, 1967.
DOI : 10.1007/BF02289588

E. E. Khoury, Unsupervised Video Indexing based on Audiovisual Characterization of Persons, 2010.
URL : https://hal.archives-ouvertes.fr/tel-00515424

. Khoury, Face-andclothing based people clustering in video content, International conference on Multimedia information retrieval, pp.295-304, 2010.

M. T. Knox and G. Friedland, Multimodal speaker diarization using oriented optical flow histograms, International Conference of the International Speech Communication Association (Interspeech), pp.290-293, 2010.

. Kryszczuk, Error handling in multimodal biometric systems using reliability measures, 13th European Signal Processing (EUSIPCO), pp.4-8, 2005.

. Li, Bimodal speaker identification using dynamic bayesian network Advances in Biometric Person Authentication, pp.1-24, 2005.

. Li, Multi-modal biometric verification based on far-score normalization, International Journal of Computer Science and Network Security (IJCSNS), vol.8, pp.250-254, 2008.

. Luhong, Face detection based on template matching and neural network verification, International Conference on Image, 2000.

B. Matthews, I. Matthews, and S. Baker, Active Appearance Models Revisited, International Journal of Computer Vision, vol.60, issue.2, pp.135-164, 2004.
DOI : 10.1023/B:VISI.0000029666.37597.d3
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8544

. Mckenna, MODELLING FACIAL COLOUR AND IDENTITY WITH GAUSSIAN MIXTURES, Pattern Recognition, vol.31, issue.12, pp.1883-1892, 1998.
DOI : 10.1016/S0031-3203(98)00066-1

. Meignier, Ehmm approach for learning and adapting sound models for speaker indexing. A Speaker Odyssey The Speaker Recognition Workshop, pp.175-180, 2001.
URL : https://hal.archives-ouvertes.fr/hal-01434656

. Messer, XM2VTSDB : The Extended M2VTS Database, Second International Conference on Audio and Video-based Biometric Person Authentication, pp.72-77, 1999.

N. Milborrow, S. Milborrow, and F. Nicolls, Locating Facial Features with an Extended Active Shape Model, 10th European Conference on Computer Vision : Part IV, pp.504-513, 2008.
DOI : 10.1007/978-3-540-88693-8_37

. Monaci, Learning Multimodal Dictionaries, IEEE Transactions on Image Processing, vol.16, issue.9, pp.538-545, 2006.
DOI : 10.1109/TIP.2007.901813
URL : https://hal.archives-ouvertes.fr/inria-00544772

. Nefian, A bayesian approach to audio-visual speaker identification. 4th international conference on Audio-and video-based biometric person authentication (AVBPA), pp.761-769, 2003.

A. Ogale, A. S. Ogale, and Y. Aloimonos, Shape and the Stereo Correspondence Problem, International Journal of Computer Vision, vol.1, issue.3, pp.147-162, 2005.
DOI : 10.1007/s11263-005-3672-3

. Petrovska-delacrétaz, Guide to biometric reference systems and performance evaluation, 2009.
DOI : 10.1007/978-1-84800-292-0

. Phillips, Overview of the multiple biometrics grand challenge, Third International Conference on Advances in Biometrics (ICB), pp.705-714, 2009.

B. Poh, N. Poh, and S. Bengio, Improving fusion with margin-derived confidence in biometric authentication tasks. Audio and videobased biometric person authentication (AVBPA), pp.474-483, 2005.

. Poh, Quality controlled multimodal fusion of biometric experts, 12th Iberoamerican Congress on Pattern Recognition (CIARP). Viña del Mar, pp.881-890, 2007.

P. Radova, V. Radova, and J. Psutka, An approach to speaker identification using multiple classifiers, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.1135-1138, 1997.
DOI : 10.1109/ICASSP.1997.596142

T. Reynolds, D. A. Reynolds, and P. A. Torres-carrasquillo, Approaches and Applications of Audio Diarization, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., pp.953-956, 2005.
DOI : 10.1109/ICASSP.2005.1416463

. Richiardi, Confidence and reliability measures in speaker verification, International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.574-595, 2006.

. Rúa, Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models, Pattern Analysis and Applications, vol.12, pp.271-284, 2008.

. Saenko, Visual speech recognition with loosely synchronized feature streams, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, pp.1424-1431, 2005.
DOI : 10.1109/ICCV.2005.251
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.1928

P. Sanderson, C. Sanderson, and K. K. Paliwal, Identity verification using speech and face information, Digital Signal Processing, vol.14, issue.5, pp.449-480, 2004.
DOI : 10.1016/j.dsp.2004.05.001
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.1287

B. Shahraray, Scene change detection and content-based sampling of video sequences. Digital video compression : algorithms and technologies, pp.2-13, 1995.

. Siegler, Automatic segmentation, classification and clustering of broadcast news audio. DARPA Speech Recognition Workshop, pp.97-99, 1997.

B. Silsbee, P. L. Silsbee, and A. C. Bovik, Computer lipreading for improved accuracy in automatic speech recognition, IEEE Transactions on Speech and Audio Processing, vol.4, issue.5, pp.337-351, 1996.
DOI : 10.1109/89.536928

. Smeaton, Video shot boundary detection: Seven years of TRECVid activity, Computer Vision and Image Understanding, vol.114, issue.4, pp.411-418, 2010.
DOI : 10.1016/j.cviu.2009.03.011

R. Spors, S. Spors, and R. Rabenstein, A real-time face tracker for color video, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), pp.1493-1496, 2001.
DOI : 10.1109/ICASSP.2001.941214

E. Sue, . Johnson, E. Sue, and P. C. Johnson, Speaker clustering using direct maximisation of the mllr-adapted likelihood, 5th International Conference on Spoken Language Processing (ICSLP), pp.1775-1779, 1998.

P. Turk, M. Turk, and A. Pentland, Eigenfaces for Recognition, Journal of Cognitive Neuroscience, vol.10, issue.9, pp.71-86, 1991.
DOI : 10.1007/BF00239352

. Vajaria, Exploring cooccurence between speech and body movement for audio-guided video localization, IEEE Transactions on Circuits and Systems for Video Technology, pp.1608-1617, 2008.

. Vallet, Robust visual features for the multimodal identification of unregistered speakers in TV talk-shows, 2010 IEEE International Conference on Image Processing, 2010.
DOI : 10.1109/ICIP.2010.5653393

. Verlinde, Multi-modal identity verification using expert fusion, Information Fusion, vol.1, issue.1, pp.17-33, 2000.
DOI : 10.1016/S1566-2535(00)00002-6
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.7260

P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, pp.511-518, 2001.
DOI : 10.1109/CVPR.2001.990517

J. H. Ward, Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, vol.58, issue.301, pp.236-244, 1963.
DOI : 10.1007/BF02289263

. Zhang, Automatic partitioning of full-motion video, Multimedia Systems, vol.1, issue.1, pp.10-28, 1993.
DOI : 10.1007/BF01210504

. Zhu, Combining speaker identification and bic for speaker diarization. International Speech Communication Association (Interspeech), 2005.
URL : https://hal.archives-ouvertes.fr/hal-01434281

Z. Zhu, Mosaic-based 3D scene representation and rendering, 11th InternationalConference on Image Processing, pp.739-754, 2005.
DOI : 10.1016/j.image.2006.08.002
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.3470

L. Développement, amélioration du réseau Internet a permis de mettre un grand nombre de contenus télévisuelstélévisuelsà disposition des utilisateurs Afin de faciliter la navigation parmi ces vidéos, il est intéressant de développer des technologies pour indexer les personnes automatiquement. Les solutions actuelles proposent de construire l'index audio

. Malheureusement, le visuel et leur association (interactivité des dialogues, variations de pose du visage, asynchronie entre la parole et l'apparence, etc) Les approches basées sur la fusion des index audio et visuel combinent les erreurs d'indexation issues de chaque modalité. Les travaux présentés dans ce rapport exploitent la complémentarité entre les informations audio et visuelle afin de palier aux faiblesses de chaque modalité. Ainsi, une modalité peut appuyer l'indexation d'une personne lorsque l

. Afin-de-détecter-automatiquement-la-présence, nous avons développé une nouvelle méthode de détection de mouvement des l` evres basée sur la mesure du degré de désordre de la direction des pixels autour de la région des l` evres. L'´ evaluation, réalisée sur le corpus de d'´ emission de plateaux, montre une amélioration significative de la détection des visages parlants comparécomparéà l'´ etat de l'art dans ce contexte. En particulier, notre méthode s'avèrê etre plus robustè a un mouvement global du visage. Enfin, nous avons proposé deux schémas de correction. Le premier est basé sur une modification systématique de la modalité considérée a priori la moins fiable. Le second compare des scores de vérification de l'identité non supervisée afin de déterminer quelle modalité a ´ echoué et la corriger