R. André-obrecht, A new statistical approach for the automatic segmentation of continuous speech signals, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.36, issue.1, pp.29-40, 1988.
DOI : 10.1109/29.1486

P. K. Atrey, M. A. Hossain, A. Saddik, and M. S. Kankanhalli, Multimodal fusion for multimedia analysis: a survey, Multimedia Systems, vol.24, issue.11, pp.345-379, 2010.
DOI : 10.1007/s00530-010-0182-0

P. K. Atrey, N. C. Maddage, and M. S. Kankanhalli, Audio Based Event Detection for Multimedia Surveillance, 2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, pp.14-19, 2006.
DOI : 10.1109/ICASSP.2006.1661400

S. Baghdadi, Extraction Multimodale de Métadonnées de Séquences Videos dans un Cadre Bayésien, 2010.

Y. Baveye, F. Urban, C. Chamaret, V. Demoulin, and P. Hellier, Saliency-Guided Consistent Color Harmonization, Proceedings of the 4th Computational Color Imaging Workshop, pp.105-118, 2013.
DOI : 10.1007/978-3-642-36700-7_9

J. P. Bello, C. Duxbury, M. Davies, and M. Sandler, On the Use of Phase and Energy for Musical Onset Detection in the Complex Domain, IEEE Signal Processing Letters, vol.11, issue.6, pp.553-556, 2004.
DOI : 10.1109/LSP.2004.827951

K. P. Bennett and C. Campbell, Support vector machines, ACM SIGKDD Explorations Newsletter, vol.2, issue.2, pp.1-13, 2000.
DOI : 10.1145/380995.380999

J. Bilmes, Dynamic Bayesian Multinets, Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp.38-45, 2000.

C. M. Bishop and M. E. Tipping, A hierarchical latent variable model for data visualization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.20, issue.3, pp.281-293, 1998.
DOI : 10.1109/34.667885

J. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher et al., ALIZE/SpkDet : A State-ofthe-Art Open Source Software for Speaker Recognition, Proceedings of Odyssey : the Speaker and Language Recognition Workshop, 2008.

L. Breiman, Random Forests, Machine Learning, vol.45, issue.1, pp.5-32, 2001.
DOI : 10.1023/A:1010933404324

D. Brezeale, Using Closed Captions and Visual Features to Classify Movies by Genre, Proceedings of the 7th International Workshop on Multimedia Data Mining, 2006.

D. Brezeale and D. J. Cook, Automatic Video Classification: A Survey of the Literature, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol.38, issue.3, pp.416-430, 2008.
DOI : 10.1109/TSMCC.2008.919173

M. Bugalho, J. Portelo, I. Trancoso, T. Pellegrini, and A. Abad, Detecting Audio Events for Semantic Video Search, InterSpeech, 2009.

C. J. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, vol.2, issue.2, pp.121-167, 1998.
DOI : 10.1023/A:1009715923555

J. J. Burred, Genetic motif discovery applied to audio analysis, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
DOI : 10.1109/ICASSP.2012.6287891

R. Cai, L. Lu, H. Zhang, and L. Cai, Highlight Sound Effects Detection in Audio Stream, Proceedings of the IEEE International Conference on Multimedia and Expo, pp.37-40, 2003.

C. Canton-ferrer, T. Butko, C. Segura, X. Giro, C. Nadeu et al., Audiovisual event detection towards scene understanding, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp.81-88, 2009.
DOI : 10.1109/CVPRW.2009.5204264

C. Chang and C. Lin, LIBSVM, ACM Transactions on Intelligent Systems and Technology, vol.2, issue.3, pp.27-28, 2011.
DOI : 10.1145/1961189.1961199

E. Charniak, Bayesian Networks Without Tears : Making Bayesian Networks More Accessible to the Probabilistically Unsophisticated, Artificial Intelligence Magazine, vol.12, issue.4, pp.50-63, 1991.

C. Chen, A. Abdallah, and W. Wolf, Audiovisual Gunshot Event Recognition, 2006 IEEE International Conference on Systems, Man and Cybernetics, pp.4807-4812, 2006.
DOI : 10.1109/ICSMC.2006.385066

L. Chen, H. Hsu, L. Wang, and C. Su, Violence Detection in Movies, 2011 Eighth International Conference Computer Graphics, Imaging and Visualization, pp.119-124, 2011.
DOI : 10.1109/CGIV.2011.14

L. Chen, C. Su, C. Weng, and H. M. Liao, Action Scene Detection with Support Vector Machines, Journal of Multimedia, vol.4, issue.4, pp.248-253, 2009.
DOI : 10.4304/jmm.4.4.248-253

P. Chen, R. Fan, and C. Lin, A Study on SMO-Type Decomposition Methods for Support Vector Machines, IEEE Transactions on Neural Networks, vol.17, issue.4, pp.893-908, 2006.
DOI : 10.1109/TNN.2006.875973

Y. Chen, L. Zhang, B. Lin, Y. Xu, X. Ren et al., Fighting Detection Based on Optical Flow Context Histogram Learning Bayesian Belief Network Classifiers : Algorithms and System, Innovations in Bio-inspired Computing and Applications (IBICA) Second International Conference on Proceedings of 14 th Biennial conference of the Canadian Society on Computational Studies of Intelligence : Advances in Artificial Intelligence, pp.95-98, 2001.

J. Cheng, R. Greiner, J. Kelly, D. Bell, and W. Liu, Learning Bayesian networks from data: An information-theory based approach, Artificial Intelligence, vol.137, issue.1-2, pp.43-90, 2002.
DOI : 10.1016/S0004-3702(02)00191-1

URL : http://doi.org/10.1016/s0004-3702(02)00191-1

D. M. Chickering, Learning Equivalence Classes of Bayesian-Network Structures, Journal of Machine Learning Research, vol.2, pp.445-498, 2002.

M. L. Chin and J. J. Burred, Audio Event Detection Based on Layered Symbolic Sequence Representations, Proceedings of the 37th International Conference on Acoustics, Speech, and Signal Processing, 2012.

V. Claveau, Acquisition Automatique de Lexiques Sémantiques pour la Recherche d'Information, 2003.

C. Clavel, T. Ehrette, and G. Richard, Events Detection for an Audio-Based Surveillance System, 2005 IEEE International Conference on Multimedia and Expo, pp.1306-1309, 2005.
DOI : 10.1109/ICME.2005.1521669

C. Clavel, I. Vasilescu, L. Devillers, G. Richard, and T. Ehrette, Fear-type emotion recognition for future audio-based surveillance systems, Speech Communications, pp.487-503, 2008.
DOI : 10.1016/j.specom.2008.03.012

URL : https://hal.archives-ouvertes.fr/hal-00499211

G. F. Cooper and E. Herskovits, A Bayesian method for the induction of probabilistic networks from data, Machine Learning, pp.309-347, 1992.
DOI : 10.1007/BF00994110

M. Cristani, M. Bicego, and V. Murino, Audio-Visual Event Recognition in Surveillance Video Sequences, IEEE Transactions on Multimedia, vol.9, issue.2, pp.257-267, 2007.
DOI : 10.1109/TMM.2006.886263

N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-taylor, On Kernel Target Alignment, Advances in Neural Information Processing Systems 14, pp.367-373, 2002.
DOI : 10.1007/3-540-33486-6_8

A. Datta, M. Shah, N. Da, and V. Lobo, Person-on-person violence detection in video data, Object recognition supported by user interaction for service robots, 2002.
DOI : 10.1109/ICPR.2002.1044748

R. Datta, D. Joshi, J. Li, and J. Z. Wang, Studying Aesthetics in Photographic Images Using a Computational Approach, Proceeding of the International Conference on Computer Vision, pp.288-301, 2006.
DOI : 10.1007/11744078_23

R. Davis, B. Buchanan, and E. Shortliffe, Production rules as a representation for a knowledge-based consultation program???, Artificial Intelligence, vol.8, issue.1, pp.15-45, 1977.
DOI : 10.1016/0004-3702(77)90003-0

F. D. De-souza, G. C. Chá-andvez, E. A. Valle, A. De, and A. Araujo, Violence Detection in Video Using Spatio-Temporal Features, 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images, pp.224-230, 2010.
DOI : 10.1109/SIBGRAPI.2010.38

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-End Factor Analysis for Speaker Verification, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.4, pp.788-798, 2011.
DOI : 10.1109/TASL.2010.2064307

M. Delakis, Multimodal Tennis Video Structure Analysis with Segment Models, 2006.
DOI : 10.1049/ic.2005.0709

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.321.8775

C. Demarty, C. Penet, G. Gravier, and M. Soleymani, A Benchmarking Campaign for the Multimodal Detection of Violent Scenes in Movies, Proceedings of the ECCV Workshop on Information Fusion in Computer Vision for Concept Recognition, 2012.
DOI : 10.1007/978-3-642-33885-4_42

URL : https://hal.archives-ouvertes.fr/hal-00767036

C. Demarty, C. Penet, G. Gravier, and M. Soleymani, The MediaEval 2012 Affect Task : Violent Scenes Detection, Proceedings of the MediaEval 2012 Workshop, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00757577

P. Q. Dinh, C. Dorai, and S. Venkatesh, Video Genre Categorization Using Audio Wavelet Coefficients, Proceedings of the 5th Asian Conference on Computer Vision, 2002.

G. Elidan and N. Friedman, Learning the Dimensionality of Hidden Variables, Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, 2001.

G. Elidan and N. Friedman, Learning Hidden Variable Networks : The Information Bottleneck Approach, Journal of Machine Learning Research, vol.6, pp.81-127, 2005.

G. Elidan, N. Lotner, N. Friedman, and D. Koller, Discovering Hidden Variables : A Structure-Based Approach, Neural Information Processing Systems, pp.479-485, 2001.

S. Essid, Classification Automatique des Signaux Audio-Fréquences : Reconnaissance des Instruments de Musique, 2005.

B. Fernando, E. Fromont, D. Muselet, and M. Sebban, Supervised learning of Gaussian mixture models for visual vocabulary generation, Pattern Recognition, vol.45, issue.2, pp.897-907, 2012.
DOI : 10.1016/j.patcog.2011.07.021

M. Fradet, Contribution à la Segmentation de Séquences d'Images au Sens du Mouvement dans un Contexte Semi-Automatique, 2010.

N. Friedman, K. Murphy, and S. Russell, Learning the Structure of Dynamic Probabilistic Networks, Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp.139-147, 1998.

T. Giannakopoulos, D. I. Kosmopoulos, A. Aristidou, and S. Theodoridis, Violence Content Classification Using Audio Features, Proceedings of the 4th Helenic Conference on Artificial Intelligence, pp.502-507, 2006.
DOI : 10.1023/A:1013241718521

URL : http://doi.org/10.1007/11752912_55

T. Giannakopoulos, D. I. Kosmopoulos, A. Aristidou, and S. Theodoridis, A Multi- Class Audio Classification Method With Respect To Violent Content In Movies Using Bayesian Networks Audio-Visual Fusion for Detecting Violent Scenes in Videos, Proceedings of the 9th IEEE Workshop on Multimedia Signal Processing Artificial Intelligence : Theories, Models and Applications, Lecture Notes in Computer Science, pp.90-93, 2007.

Y. Gong, W. Wang, S. Jiang, Q. Huang, and W. Gao, Detecting Violent Scenes in Movies by Auditory and Visual Cues, Proceedings of the 9th Pacific Rim Conference on Multimedia : Advances in Multimedia Information Processing, pp.317-326, 2008.
DOI : 10.1109/TSA.2005.860344

G. Gravier, C. Demarty, S. Baghdadi, and P. Gros, Classification-oriented structure learning in Bayesian networks for multimodal event detection in videos, Multimedia Tools and Applications, vol.7, issue.4, pp.1-17, 2012.
DOI : 10.1007/s11042-012-1169-y

URL : https://hal.archives-ouvertes.fr/hal-00712589

D. Grossman and P. Domingos, Learning Bayesian network classifiers by maximizing conditional likelihood, Twenty-first international conference on Machine learning , ICML '04, 2004.
DOI : 10.1145/1015330.1015339

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.134.4637

N. Haering, R. J. Qian, and M. I. Sezan, A semantic event-detection approach and its application to detecting hunts in wildlife video, IEEE Transactions on Circuits and Systems for Video Technology, pp.857-868, 2000.
DOI : 10.1109/76.867923

T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, Sound Event Detection in Multisource Environments Using Source Separation, Workshop on Machine Listening in Multisource Environments, 2011.

B. Ionescu, J. Schlüter, I. Mironic?amironic?a, and M. Schedl, A Naïve Mid-level Conceptbased Fusion Approach to Violence Detection in Hollywood Movies, Proceedings of the ACM International Conference on Multimedia Retrieval, 2013.

G. Irie, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa, Affective Audio-Visual Words and Latent Topic Driving Model for Realizing Movie Affective Scene Classification, IEEE Transactions on Multimedia, vol.12, issue.6, pp.523-535, 2010.
DOI : 10.1109/TMM.2010.2051871

R. S. Jasinschi, N. Dimitrova, T. Mcgee, L. Agnihotri, J. Zimmerman et al., A probabilistic layered framework for integrating multimedia content and context information, IEEE International Conference on Acoustics Speech and Signal Processing, pp.2057-2060, 2002.
DOI : 10.1109/ICASSP.2002.5745038

H. Jégou, M. Douze, and C. Schmid, On the burstiness of visual elements, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206609

H. Jégou, M. Douze, and C. Schmid, Product Quantization for Nearest Neighbor Search, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, issue.1, pp.117-128, 2011.
DOI : 10.1109/TPAMI.2010.57

Y. Jiang, Q. Dai, C. C. Tan, X. Xue, and C. Ngo, The Shanghai-Hongkong Team at MediaEval2012 : Violent Scene Detection Using Trajectory-based Features, Proceedings of the MediaEval 2012 Workshop, 2012.

T. Joachims, Estimating the Generalization Performance of a SVM Efficiently, 1999.

T. Joachims, Training linear SVMs in linear time, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '06, 2006.
DOI : 10.1145/1150402.1150429

C. Joder, S. Essid, and G. Richard, Temporal Integration for Audio Classification With Application to Musical Instrument Classification, IEEE Transactions on Audio, Speech, and Language Processing, vol.17, issue.1, pp.174-186, 2009.
DOI : 10.1109/TASL.2008.2007613

K. S. Jones, A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL, Journal of Documentation, vol.28, issue.1, pp.11-21, 1972.
DOI : 10.1108/eb026526

P. Kenny, Joint Factor Analysis of Speaker and Session Variability : Theory and Algorithms, CRIM, 2006.

P. Kenny, G. Boulianne, and P. Dumouchel, Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing, vol.13, issue.3, pp.345-354, 2005.
DOI : 10.1109/TSA.2004.840940

P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Factor Analysis Simplified, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., pp.637-640, 2005.
DOI : 10.1109/ICASSP.2005.1415194

P. Kenny and P. Dumouchel, Experiments in Speaker Verification using Factor Analysis Likelihood Ratios, Proceedings of Odyssey : the Speaker and Language Recognition Workshop, pp.219-226, 2004.

E. Kijak, Structuration Multimodale des Vidéos de Sports par Modèles Stochastiques, 2003.

T. Kocka and N. L. Zhang, Dimension Correction for Hierarchical Latent Class Models, Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp.267-274, 2002.

E. G. Krug, J. A. Mercy, L. L. Dahlberg, and A. B. Zwi, The World Report on Violence and Health. The Lancet, pp.1083-1088, 2002.

A. Kumar, P. Dighe, R. Singh, S. Chaudhuri, and B. Raj, Audio Event Detection From Acoustic Unit Occurence Pattern, Proceedings of the 37th International Conference on Accoustics, Speech and Signal Processing, 2012.
DOI : 10.1109/icassp.2012.6287923

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.220.1766

L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, IEEE Transactions on Neural Networks, vol.18, issue.3, 2004.
DOI : 10.1109/TNN.2007.897478

H. Langseth and T. D. Nielsen, Latent Classification Models, Machine Learning, vol.6, issue.1, pp.237-265, 2005.
DOI : 10.1007/s10994-005-0472-5

H. Langseth and T. D. Nielsen, Classification using Hierarchical Na??ve Bayes models, Machine Learning, vol.30, issue.3, pp.135-159107, 2005.
DOI : 10.1007/s10994-006-6136-2

L. Li, A NOVEL VIOLENT VIDEOS CLASSIFICATION SCHEME BASED ON THE BAG OF AUDIO WORDS FEATURES, Proceedings of the 9th International Conference on Information Technology : New Generations, pp.7-13, 2012.
DOI : 10.1142/S1469026812500101

Z. Stan, G. D. Li, and . Guo, Content-Based Audio Classification and Retrieval Using SVM Learning, Proceedings of IEEE International Conference on Multimedia and Expo, 2000.

T. Li, M. Ogihara, and Q. Li, A comparative study on content-based music genre classification, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval , SIGIR '03, pp.282-289, 2003.
DOI : 10.1145/860435.860487

J. Lin, Y. Sun, and W. Wang, Violence Detection in Movies with Auditory and Visual Cues, 2010 International Conference on Computational Intelligence and Security, pp.561-565, 2010.
DOI : 10.1109/CIS.2010.128

J. Lin and W. Wang, Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training, Proceedings of the 10th Pacific-Rim Conference on Multimedia, pp.930-935, 2009.
DOI : 10.1007/978-3-642-10467-1_84

J. Lin, Divergence measures based on the Shannon entropy, IEEE Transactions on Information Theory, vol.37, issue.1, pp.145-151, 1991.
DOI : 10.1109/18.61115

Y. Liu, W. Zhao, C. Ngo, C. Xu, and H. Lu, Coherent bag-of audio words model for efficient large-scale video copy detection, Proceedings of the ACM International Conference on Image and Video Retrieval, CIVR '10, pp.89-96, 2010.
DOI : 10.1145/1816041.1816057

D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol.60, issue.2, pp.91-110, 2004.
DOI : 10.1023/B:VISI.0000029664.99615.94

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.4931

P. Lucas, Restricted Bayesian Network Structure Learning, Advances in Bayesian Networks, Studies in Fuzziness and Soft Computing, pp.217-232, 2002.
DOI : 10.1007/978-3-540-39879-0_12

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1240

D. Matrouf, N. Scheffer, B. Fauve, and J. Bonastre, A Straightforward and Efficient Implementation of the Factor Analysis Model for Speaker Verification, Proceedings of Interspeech, pp.1242-1245, 2007.
URL : https://hal.archives-ouvertes.fr/hal-01318480

D. Matrouf, F. Verdet, M. Rouvier, J. Bonastre, and G. Linarès, Modeling nuisance variabilities with factor analysis for GMM-based audio pattern classification, Computer Speech & Language, vol.25, issue.3, pp.481-498, 2011.
DOI : 10.1016/j.csl.2010.11.001

URL : https://hal.archives-ouvertes.fr/hal-01318503

M. F. Mckinney and J. Breebaart, Features for Audio and Music Classification, Proceeding of the International Society for Music Information Retrieval, 2003.

S. Moncrieff, C. Dorai, and S. Venkatesh, Affect computing in film through sound energy dynamics, Proceedings of the ninth ACM international conference on Multimedia , MULTIMEDIA '01, pp.525-527, 2001.
DOI : 10.1145/500141.500231

K. P. Murphy, The Bayes Net Toolbox for Matlab, Journal of Computing Science and Statistics, vol.33, 2001.

J. Nam, M. Alghoniemy, and A. H. Tewfik, Audio-visual content-based violent scene characterization, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269), pp.353-357, 1998.
DOI : 10.1109/ICIP.1998.723496

E. B. Nievas, O. D. Suarez, G. B. García, and R. Sukthankar, Violence Detection in Video Using Computer Vision Techniques, Proceedings of the 14th International Conference on Computer Analysis of Images and Patterns, pp.332-339, 2011.

G. T. Papadopoulos, V. Mezaris, I. Kompatsiaris, and M. G. Strintzis, Combining multimodal and temporal contextual information for semantic video analysis, 2009 16th IEEE International Conference on Image Processing (ICIP), pp.4325-4328, 2009.
DOI : 10.1109/ICIP.2009.5413673

G. Pass, R. Zabih, and J. Miller, Comparing images using color coherence vectors, Proceedings of the fourth ACM international conference on Multimedia , MULTIMEDIA '96, pp.65-73, 1996.
DOI : 10.1145/244130.244148

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.9596

J. Pearl, Causality : Models, Reasoning, and Inference, 2000.
DOI : 10.1017/CBO9780511803161

T. Pellegrini, J. Portelo, I. Trancoso, A. Abad, and M. Bugalho, Hierarchical Clustering Experiments for Application to Audio Event Detection, Proceedings of the 13th International Conference on Speech and Computer, 2009.

C. Penet, C. Demarty, G. Gravier, and P. Gros, De la Détection d'Évènements Sonores Violents par SVM dans les Films, 2011.

C. Penet, C. Demarty, G. Gravier, and P. Gros, Multimodal information fusion and temporal integration for violence detection in movies, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
DOI : 10.1109/ICASSP.2012.6288397

URL : https://hal.archives-ouvertes.fr/hal-00671016

C. Penet, C. Demarty, G. Gravier, and P. Gros, Audio event detection in movies using multiple audio words and contextual Bayesian networks, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI), 2013.
DOI : 10.1109/CBMI.2013.6576546

URL : https://hal.archives-ouvertes.fr/hal-00822022

T. Perperis, T. Giannakopoulos, A. Makris, D. I. Kosmopoulos, S. Tsekeridou et al., Multimodal and ontology-based fusion approaches of audio and visual processing for violence detection in movies, Expert Systems with Applications, issue.11, pp.3814102-14116, 2011.
DOI : 10.1016/j.eswa.2011.04.219

A. Pikrakis, T. Giannakopoulos, and S. Theodoridis, Gunshot detection in audio streams from movies by means of dynamic programming and Bayesian networks, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.21-24, 2008.
DOI : 10.1109/ICASSP.2008.4517536

J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad et al., Non-speech audio event detection, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.1973-1976, 2009.
DOI : 10.1109/ICASSP.2009.4959998

M. Ramona, G. Richard, M. Ramona, and G. Richard, Segmentation Parole/Musique Par Machines à Vecteurs de Support Comparison Of Different Strategies For A SVM- Based Audio Segmentation, Proceedings of the European Conference on Signal Processing, 2008.

E. Ravelli, G. Richard, and L. Daudet, Audio Signal Representations for Indexing in the Transform Domain, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.3, pp.434-446, 2010.
DOI : 10.1109/TASL.2009.2025099

G. Richard, M. Ramona, and S. Essid, Combined Supervised and Unsupervised Approaches for Automatic Segmentation of Radiophonic Audio Streams, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '07, pp.46-464, 2007.
DOI : 10.1109/ICASSP.2007.366272

M. Rouvier, D. Matrouf, and G. Linarès, Factor Analysis for Audio-Basedased Video Genre Classification, Proceedings of Interspeech, 2009.

J. Saunders, Real-time discrimination of broadcast speech/music, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, pp.993-996, 1996.
DOI : 10.1109/ICASSP.1996.543290

B. Schiele and J. L. Crowley, Object recognition using multidimensional receptive field histograms, Proceedings of the European Conference in Computer Vision, pp.610-619, 1996.
DOI : 10.1007/BFb0015571

URL : https://hal.archives-ouvertes.fr/tel-00004962

J. Schlüter, B. Ionescu, I. Mironic?amironic?a, and M. Schedl, ARF @ MediaEval 2012 : An Uninformed Approach to Violence Detection in Hollywood Movies, Proceedings of the MediaEval 2012 Workshop, 2012.

C. G. Snoek and M. Worring, Multimodal Video Indexing: A Review of the State-of-the-art, Multimedia Tools and Applications, vol.25, issue.1, pp.5-35, 2003.
DOI : 10.1023/B:MTAP.0000046380.27575.a5

M. Soleymani, M. Pantic, and T. Pun, Multi-Modal Emotion Recognition in Response to Videos, IEEE Transactions on Affective Computing, vol.1, pp.211-223, 2012.

P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, 2001.
DOI : 10.1007/978-1-4612-2748-9

J. Sun, X. Wu, S. Yan, L. Fah, C. et al., Hierarchical Spatio- Temporal Context Modeling for Action Recognition, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.2004-2011, 2009.

I. Trancoso, T. Pellegrini, J. Portelo, H. Meinedo, M. Bugalho et al., Audio contributions to semantic video search, 2009 IEEE International Conference on Multimedia and Expo, pp.630-633, 2009.
DOI : 10.1109/ICME.2009.5202575

I. Trancoso, J. Portelo, M. Bugalho, J. P. Neto, and A. J. Serralheiro, Training Audio Events Detectors with a Sound Effects Corpus, Proceedings of Interspeech, pp.2546-2549, 2008.

C. Vair, D. Colibro, F. Castaldo, E. Dalmasso, and P. Laface, Channel Factors Compensation in Model and Feature Domain for Speaker Recognition, 2006 IEEE Odyssey, The Speaker and Language Recognition Workshop, pp.1-6, 2006.
DOI : 10.1109/ODYSSEY.2006.248117

F. Vallet, S. Essid, J. Carrive, and G. Richard, Robust visual features for the multimodal identification of unregistered speakers in TV talk-shows, 2010 IEEE International Conference on Image Processing, 2010.
DOI : 10.1109/ICIP.2010.5653393

N. Vasconcelos and A. Lippman, Towards semantically meaningful feature spaces for the characterization of video content, Proceedings of International Conference on Image Processing, pp.25-28, 1997.
DOI : 10.1109/ICIP.1997.647375

J. Vendrig and M. Worring, Interactive Adaptive Movie Annotation, Proceedings of the IEEE International Conference on Multimedia and Expo, pp.93-96, 2002.

R. Vogt and S. Sridharan, Explicit modelling of session variability for speaker verification, Computer Speech & Language, vol.22, issue.1, pp.17-38, 2008.
DOI : 10.1016/j.csl.2007.05.003

S. Wang, S. Jiang, Q. Huang, and W. Gao, Shot classification for action movies based on motion characteristics, 2008 15th IEEE International Conference on Image Processing, pp.2508-2511, 2008.
DOI : 10.1109/ICIP.2008.4712303

Y. Wang, Z. Liu, and J. Huang, Multimedia content analysis-using both audio and visual clues, IEEE Signal Processing Magazine, vol.17, issue.6, pp.12-36, 2000.
DOI : 10.1109/79.888862

U. Westermann and R. Jain, Toward a Common Event Model for Multimedia Applications, IEEE Multimedia, vol.14, issue.1, pp.19-29, 2007.
DOI : 10.1109/MMUL.2007.23

Y. Wu, E. Y. Chang, K. C. Chang, and J. R. Smith, Optimal multimodal fusion for multimedia data analysis, Proceedings of the 12th annual ACM international conference on Multimedia , MULTIMEDIA '04, pp.572-579, 2004.
DOI : 10.1145/1027527.1027665

L. Xie, H. Sundaram, and M. Campbell, Event Mining in Multimedia Streams, Proceedings of the IEEE, pp.623-647, 2008.

J. Yang, Y. Jiang, A. G. Hauptmann, and C. Ngo, Evaluating bag-of-visual-words representations in scene classification, Proceedings of the international workshop on Workshop on multimedia information retrieval , MIR '07, pp.197-206, 2007.
DOI : 10.1145/1290082.1290111

A. Yoshitaka and M. Miyake, Scene detection by audio-visual features, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001., pp.48-51, 2001.
DOI : 10.1109/ICME.2001.1237652

N. L. Zhang, Hierarchical Latent Class Models for Cluster Analysis, Proceedings of the 18th National Conference on Artificial intelligence, pp.230-237, 2002.

N. L. Zhang, T. D. Nielsen, and F. V. Jensen, Latent variable discovery in classification models, Artificial Intelligence in Medicine, vol.30, issue.3, pp.283-299, 2004.
DOI : 10.1016/j.artmed.2003.11.004

X. Zou, O. Wu, Q. Wang, W. Hu, and J. Yang, Multi-modal Based Violent Movies Detection in Video Sharing Sites, Intelligent Science and Intelligent Data Engineering, pp.347-355, 2013.
DOI : 10.1007/978-3-642-36669-7_43

]. J. Fleureau, C. Penet, P. Guillotel, and C. Demarty, Electrodermal activity applied to violent scenes impact measurement and user profiling, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2012.
DOI : 10.1109/ICSMC.2012.6378302

C. Penet, C. Demarty, G. Gravier, and P. Gros, Multimodal information fusion and temporal integration for violence detection in movies, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
DOI : 10.1109/ICASSP.2012.6288397

URL : https://hal.archives-ouvertes.fr/hal-00671016

C. Penet, C. Demarty, G. Gravier, and P. Gros, Audio Event Detection using Multiple Audio Words and Contextual Network, CBMI -11th International Workshop on Content-Based Multimedia Indexing, 2013. [Best Paper Award]
DOI : 10.1109/cbmi.2013.6576546

URL : https://hal.inria.fr/hal-00822022/file/CBMI2013_CedricPENET_CameraReady.pdf

C. Demarty, C. Penet, M. Schedl, B. Ionescu, V. L. Quang et al., The MediaEval 2013 Affect Task : Violent Scenes Detection, MediaEval 2013 Workshop, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00932551

C. Demarty, C. Penet, G. Gravier, and M. Soleymani, The MediaEval 2011 Affect Task : Violent Scenes Detection in Hollywood Movies, MediaEval 2011 Workshop, 2011.

C. Demarty, C. Penet, G. Gravier, and M. Soleymani, A Benchmarking Campaign for the Multimodal Detection of Violent Scenes in Movies, ECCV 2012 Workshop on Information Fusion in Computer Vision for Concept Recognition, 2012.
DOI : 10.1007/978-3-642-33885-4_42

URL : https://hal.archives-ouvertes.fr/hal-00767036

C. Demarty, C. Penet, G. Gravier, and M. Soleymani, The MediaEval 2012 Affect Task : Violent Scenes Detection, MediaEval 2012 Workshop, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00757577

C. Penet, C. Demarty, G. Gravier, and P. Gros, Technicolor and INRIA/IRISA at MediaEval 2011 : learning temporal modality integration with Bayesian Networks, MediaEval 2011 Workshop, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00643645

C. Penet, C. Demarty, G. Gravier, and P. Gros, Technicolor/INRIA Team at the MediaEval 2013 Violent Scenes Detection Task, MediaEval 2013 Workshop, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00906300

C. Penet, C. Demarty, M. Soleymani, G. Gravier, and P. Gros, Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task, MediaEval 2012 Workshop, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00757584

C. Penet, C. Demarty, G. Gravier, and P. Gros, De la détection d'évènements sonores violents par SVM dans les films, ORASIS -Congrès des jeunes chercheurs en vision par ordinateur

F. Botta, P. Gallardo, C. Penet, and C. Demarty, Method for setting a watching level for an audiovisual content, 2013.

M. Fradet, A. Newson, and C. Penet, Method for processing an audiovisual content and corresponding device, 2013.

D. Processus-classique, Les flèches en pointillés correspondent à des étapes facultatives (ex: l'étape d'intégration après l'étape de caractérisation), ou aux différentes possibilités en sortie d'une étape (ex: le contenu peut soit être segmenté avant l'extraction d'attributs , soit passer directement dans la phase d'extraction d'attributs) Les flèches pleines correspondent à une relation obligatoire entre deux étapes (ex: si le contenu est segmenté, alors il y a ensuite automatiquement extraction d'attributs), p.20

.. Divergence-de-jensen-shannon, entre les échantillons des films et ceux de la base d'apprentissage. Les croix correspondent aux films de l'ensemble d'apprentissage, les triangles aux films de l'ensemble de test. Pour chaque film, il y a un point par classe présente, p.48

M. Au-rappel and .. Au-mediaeval-cost, Résultats obtenus sur les films de test pour le système basé sur la représentation TF-IDF. La colonne P correspond à la Précision, p.96

.. Équipes-ayant-franchi-la-ligne-finale-en, L'équipe mentionnée par une étoile est une équipe formée des organisateurs de la tâche (notre équipe), p.111, 2012.