]. H. Akaike-1973 and . Akaike, Information theory as an extension of the maximum likelihood principle, Second International Symposium on Information Theory, pp.267-281, 1973.

]. P. Allen, Integrating Vision and Touch for Object Recognition Tasks, Multisensor Integration and Fusion for Intelligent Machines and Systems, pp.407-440, 1995.
DOI : 10.1177/027836498800700603

URL : http://academiccommons.columbia.edu/download/fedora_content/download/ac:141204/CONTENT/CUCS-240-86.pdf

]. T. Anastasio, P. E. Patton, and K. E. Belkacem-boussaid, Using Bayes' Rule to Model Multisensory Enhancement in the Superior Colliculus, Neural Computation, vol.53, issue.3, pp.1165-1187, 2000.
DOI : 10.1016/S0079-6123(08)63337-3

]. E. Arnaud, H. Christensen, Y. C. Lu, J. Barker, V. Khalidov et al., The CAVA corpus, Proceedings of the 10th international conference on Multimodal interfaces, IMCI '08, p.17, 2008.
DOI : 10.1145/1452392.1452414

URL : https://hal.archives-ouvertes.fr/inria-00373173

]. E. Bailly-bailliére, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler et al., The BANCA Database and Evaluation Protocol, AVBPA, p.17, 2003.
DOI : 10.1007/3-540-44887-X_74

]. M. Beal, N. Jojic, and H. Attias, A graphical model for audiovisual object tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.25, issue.7, pp.828-836, 2003.
DOI : 10.1109/TPAMI.2003.1206512

]. A. Benveniste, M. Métivier, and P. Priouret, Adaptive algorithms and stochastic approximations, Applications of Mathematics, vol.22, p.118, 1990.
DOI : 10.1007/978-3-642-75894-2

]. K. Bernardin and R. Stiefelhagen, Audio-visual multi-person tracking and identification for smart environments, Proceedings of the 15th international conference on Multimedia , MULTIMEDIA '07, 2007.
DOI : 10.1145/1291233.1291388

]. P. Bhat, B. Curless, M. F. Cohen, and C. L. Zitnick, Fourier Analysis of the 2D Screened Poisson Equation for Gradient Domain Problems, Proc. of ECCV, pp.114-128, 2008.
DOI : 10.1007/978-3-540-88688-4_9

]. C. Biernacki, G. Celeux, and G. Govaert, Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.22, issue.7, pp.719-725, 2000.
DOI : 10.1109/34.865189

]. C. Biernacki, G. Celeux, and G. Govaert, Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models, Computational Statistics & Data Analysis, vol.41, issue.3-4, pp.561-575, 2003.
DOI : 10.1016/S0167-9473(02)00163-9

]. R. Boyles, On the convergence of EM algorithms, Journal of the Royal Statistical Society: Series B, vol.45, issue.1, pp.47-50, 1983.

]. R. Brunelli, B. Alessio, P. Chippendale, O. Lanz, M. Omologo et al., A Generative Approach to Audio-Visual Person Tracking, Multimodal Technologies for Perception of Humans, 2007.
DOI : 10.1007/978-3-540-69568-4_3

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.332.1535

]. D. Burr and D. Alais, Chapter 14 Combining visual and auditory information, Progress in Brain Research, vol.155, issue.2 3, pp.243-258, 2006.
DOI : 10.1016/S0079-6123(06)55014-9

]. J. Castellanos and J. Tardos, Simultaneous map building and localization for mobile robots: a multisensor fusion approach, Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No.98CH36146), p.115, 1999.
DOI : 10.1109/ROBOT.1998.677271

]. G. Celeux and G. Soromenho, An entropy criterion for assessing the number of clusters in a mixture model, Journal of Classification, vol.5, issue.2, pp.195-212, 1996.
DOI : 10.1007/BF01246098

URL : https://hal.archives-ouvertes.fr/inria-00074799

]. G. Celeux, F. Forbes, and N. Peyrard, EM procedures using mean field-like approximations for Markov model-based image segmentation, Pattern Recognition, vol.36, issue.1, pp.131-144, 2003.
DOI : 10.1016/S0031-3203(02)00027-4

URL : https://hal.archives-ouvertes.fr/inria-00072526

N. Checka, K. Wilson, M. Siracusa, and T. Darrell, Multiple person and speaker activity tracking with a particle filter, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.881-884, 2004.
DOI : 10.1109/ICASSP.2004.1327252

R. Lei-chen, Y. Travis-rose, I. Qiao, F. Kimbara, T. X. Parrill et al., VACE multimodal meeting corpus, MLMI, p.17, 2005.

]. S. Chrétien and A. Hero, Kullback proximal algorithms for maximum-likelihood estimation, IEEE Transactions on Information Theory, vol.46, issue.5, pp.1800-1810, 2000.
DOI : 10.1109/18.857792

]. H. Christensen, N. Ma, S. N. Wrigley, and J. Barker, Integrating Pitch and Localisation Cues at a Speech Fragment Level, Proc. of Interspeech, pp.2769-2772, 2007.

]. G. Ciuperca, A. Ridolfi, and J. Idier, Penalized Maximum Likelihood Estimator for Normal Mixtures, Scandinavian Journal of Statistics, vol.20, issue.1, pp.45-59, 2003.
DOI : 10.1109/34.730550

]. E. Coiras, F. Baralli, and B. Evans, Rigid data association for shallow water surveys, IET Radar, Sonar & Navigation, vol.1, issue.5, pp.354-361, 2007.
DOI : 10.1049/iet-rsn:20070028

]. H. Colonius and P. Arndt, A two-stage model for visual-auditory interaction in saccadic latencies, Perception & Psychophysics, vol.115, issue.1, pp.126-147, 2001.
DOI : 10.3758/BF03200508

]. M. Cooke, Modelling auditory processing and organisation, p.16, 1993.

]. J. Courchay, A. Dalalyan, R. Keriven, and P. Sturm, A Global Camera Network Calibration Method with Linear Programming, Proc. of Int. Symp. on 3D Data Processing, Visualization and Transmission, p.23, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00523984

]. J. Cui, H. Zha, H. Zhao, and R. Shibasaki, Multi-modal tracking of people using laser scanners and video camera, Image and Vision Computing, vol.26, issue.2, pp.240-252, 2008.
DOI : 10.1016/j.imavis.2007.05.005

]. A. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society: Series B, vol.39, issue.45, pp.1-38, 1977.

]. J. Dibiase, H. Silverman, and M. Brandstein, Robust Localization in Reverberant Rooms, Microphone Arrays: Signal Processing Techniques and Applications, p.55, 2001.
DOI : 10.1007/978-3-662-04619-7_8

]. A. Doucet, N. De-freitas, and N. Gordon, Sequential Monte Carlo methods in practice, Statistics for Emgineering and Information Science, p.116, 2001.
DOI : 10.1007/978-1-4757-3437-9

]. S. Ermakov, Die Monte-Carlo methode und verwandte fragen (deutsch), p.129, 1975.

]. M. Ernst and M. S. Banks, Humans integrate visual and haptic information in a statistically optimal fashion, Nature, vol.415, issue.6870, pp.429-433, 2002.
DOI : 10.1038/415429a

]. C. Faller and J. Merimaa, Sound Localization in Complex Listening Situations: Selection of binaural Cues Based on Interaural Coherence, 2004.

]. O. Faugeras, Three dimensional computer vision: A geometric viewpoint, p.55, 1993.

]. J. Fisher, I. , T. Darrell, W. T. Freeman, and P. Viola, Learning Joint Statistical Models for Audio-Visual Fusion Segregation, Proceedings of Annual Conference on Advances in Neural Information Processing Systems, p.44, 2001.

]. J. Fisher, I. , and T. Darrell, Speaker Association With Signal-Level Audiovisual Fusion, IEEE Transactions on Multimedia, vol.6, issue.3, pp.406-413, 2004.
DOI : 10.1109/TMM.2004.827503

]. D. Forsyth and J. Ponce, Computer vision ? a modern approach, p.54, 2003.
URL : https://hal.archives-ouvertes.fr/hal-01063327

]. A. Garg, V. Pavlovi´cpavlovi´c, and J. Rehg, Boosted learning in dynamic bayesian networks for multimodal speaker detection, Proceedings of the IEEE, vol.91, issue.9, pp.1355-1369, 2003.
DOI : 10.1109/JPROC.2003.817119

]. E. Gassiat, Likelihood ratio inequalities with applications to various mixtures, Annales de l'Institut Henri Poincare (B) Probability and Statistics, vol.38, issue.6, pp.897-906, 2002.
DOI : 10.1016/S0246-0203(02)01125-1

]. D. Gatica-perez, G. Gatica-perez, J. Lathoud, I. Odobez, and . Mccowan, Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.2, pp.601-616, 2007.
DOI : 10.1109/TASL.2006.881678

]. B. Glasberg and B. C. Moore, Derivation of auditory filter shapes from notched-noise data, Hearing Research, vol.47, issue.1-2, pp.103-138, 1990.
DOI : 10.1016/0378-5955(90)90170-T

]. S. Gould, P. Baumstarck, M. Quigley, A. Y. Ng, and D. Koller, Integrating Visual and Range Data for Robotic Object Detection, ECCV Workshop on Multicamera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2), p.24, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00326789

]. D. Hall-2004, S. A. Hall, and . Mcmullen, Mathematical techniques in multisensor data fusion, p.43, 2004.

]. M. Hansard and R. Horaud, Patterns of Binocular Disparity for a Fixating Observer, Proc. of Second International Symposium of Advances in Brain, Vision, and Artificial Intelligence, pp.308-317, 2007.
DOI : 10.1007/978-3-540-75555-5_29

URL : https://hal.archives-ouvertes.fr/inria-00590234

]. M. Hansard and R. Horaud, Cyclopean geometry of binocular vision, Journal of the Optical Society of America A, vol.25, issue.9, pp.2357-2369, 2008.
DOI : 10.1364/JOSAA.25.002357

URL : https://hal.archives-ouvertes.fr/inria-00435548

]. R. Hartley and A. Zisserman, Multiple view geometry in computer vision, p.74, 2003.
DOI : 10.1017/CBO9780511811685

]. S. Haykin and Z. Chen, The Cocktail Party Problem, Neural Computation, vol.31, issue.2, pp.1875-1902, 2005.
DOI : 10.1016/0378-5955(91)90148-3

]. T. Hazen, E. Saenko, C. La, and J. Glass, A segment-based audio-visual speech recognizer, Proceedings of the 6th international conference on Multimodal interfaces , ICMI '04, p.17, 2004.
DOI : 10.1145/1027933.1027972

]. M. Heckmann, F. Berthommier, and K. Kroschel, Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition, EURASIP Journal on Advances in Signal Processing, vol.2002, issue.11, pp.1260-1273, 2002.
DOI : 10.1155/S1110865702206150

M. Hofbauer, S. M. Wuerger, G. F. Meyer, F. Roehrbein, K. Schill et al., Catching audiovisual mice: Predicting the arrival time of auditory-visual motion signals, Cognitive, Affective, & Behavioral Neuroscience, vol.4, issue.2, pp.241-250, 2004.
DOI : 10.3758/CABN.4.2.241

]. T. Hospedales, J. J. Cartwright, and S. Vijayakumar, Structure Inference for Bayesian Multisensory Perception and Tracking, Proc. of IJCAI, pp.2122-2128, 2007.

]. T. Hospedales and S. Vijayakumar, Structure Inference for Bayesian Multisensory Scene Understanding, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30, issue.12, pp.2140-2157, 2008.
DOI : 10.1109/TPAMI.2008.25

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.494.6138

]. I. Howard and W. B. Templeton, Human spatial orientation, 1966.

]. M. Jacobsen, Point process theory and applications. marked point and piecewise deterministic processses. Probability and Its Applications, Birkhäuser, vol.114, issue.117, pp.99-119, 2006.

]. M. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, An Introduction to Variational Methods for Graphical Models, Learning in Graphical Models, pp.105-162, 1998.
DOI : 10.1007/978-94-011-5014-9_5

]. R. Joshi and A. C. Sanderson, Multisensor fusion: A minimal representation framework, World Scientific, vol.44, p.115, 1999.
DOI : 10.1142/4106

]. D. Kadunce, J. W. Vaughan, M. T. Wallace, and B. E. Stein, The influence of visual and auditory receptive field organization on multisensory integration in the superior colliculus, Experimental Brain Research, vol.139, issue.3, pp.303-310, 2001.
DOI : 10.1007/s002210100772

]. R. Kalman and R. S. Bucy, New Results in Linear Filtering and Prediction Theory, Journal of Basic Engineering, vol.83, issue.1, pp.95-108, 1961.
DOI : 10.1115/1.3658902

]. V. Katkovnik and V. Spokoiny, Spatially Adaptive Estimation via Fitted Local Likelihood Techniques, IEEE Transactions on Signal Processing, vol.56, issue.3, pp.873-886, 2008.
DOI : 10.1109/TSP.2007.907873

]. V. Khalidov, F. Forbes, M. Hansard, E. Arnaud, and R. Horaud, Detecion and Localization of 3D Audio-Visual Objects Using Unsupervised Clustering, Proc. of ICMI, 2008.

]. V. Khalidov, F. Forbes, H. Miles, E. Arnaud, and R. Horaud, Audio- Visual Clustering for Multiple Speaker Localization, MLMI, pp.86-97, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00373154

]. V. Khalidov-2010, F. Khalidov, R. Forbes, and . Horaud, Conjugate Mixture Models for Clustering Multimodal Data, Neural Computation, vol.49, issue.3, pp.48-83, 2010.
DOI : 10.1007/978-94-011-3436-1

Z. Khan, T. Balch, and F. Dellaert, MCMC-based particle filtering for tracking a variable number of interacting targets, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.27, issue.11, pp.1805-1918, 2005.
DOI : 10.1109/TPAMI.2005.223

]. A. King, The superior colliculus, Current Biology, vol.14, issue.9, pp.335-338, 2004.
DOI : 10.1016/j.cub.2004.04.018

]. A. King, Multisensory Integration: Strategies for Synchronization, Current Biology, vol.15, issue.9, pp.339-341, 2005.
DOI : 10.1016/j.cub.2005.04.022

URL : http://doi.org/10.1016/j.cub.2005.04.022

]. D. Doya, S. Ishii, A. Pouget, R. P. Rao, and B. Brain, Bayesian Models of Sensory Cue Integration, Probabilistic approaches to neural coding, pp.189-206, 2007.

A. Kushal, M. Rahurkar, L. Fei-fei, J. Ponce, and T. Huang, Audio-Visual Speaker Localization Using Graphical Models, 18th International Conference on Pattern Recognition (ICPR'06), pp.291-294, 2006.
DOI : 10.1109/ICPR.2006.284

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.123.9061

]. G. Lathoud, J. Odobez, and D. Gatica-perez, AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking, MLMI, pp.182-195, 2004.
DOI : 10.1007/978-3-540-30568-2_16

]. T. Lefebvre-2001, H. Lefebvre, J. Bruyninckx, and . De-schutter, 4 Kalman Filters for Nonlinear Systems, Int. J. of Control, vol.77, pp.639-653, 2001.
DOI : 10.1007/11533054_4

]. J. Lewald and R. Guski, Cross-modal perceptual integration of spatially and temporally disparate auditory and visual stimuli, Cognitive Brain Research, vol.16, issue.3, pp.468-478, 2003.
DOI : 10.1016/S0926-6410(03)00074-0

M. Lu and . Cooke, Motion strategies for binaural localisation of speech sources in azimuth and distance by artificial listeners, Speech Communication, vol.53, issue.5, p.125, 2010.
DOI : 10.1016/j.specom.2010.06.001

]. R. Luo, Y. Chih-chen, and L. S. Kuo, Multisensor fusion and integration: approaches, applications, and future research directions, IEEE Sensors Journal, vol.2, issue.2, pp.107-119, 2002.
DOI : 10.1109/JSEN.2002.1000251

]. J. Ma, L. Xu, and M. Jordan, Asymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures, Neural Computation, vol.57, issue.12, pp.2881-2907, 2000.
DOI : 10.1162/neco.1996.8.1.129

]. R. Mahler, A general theory of multitarget extended Kalman filters, Signal Processing, Sensor Fusion, and Target Recognition XIV, p.116, 2005.
DOI : 10.1117/12.603576

]. S. Majumder, S. Scheding, and H. Durrant-whyte, Multisensor data fusion for underwater navigation, Robotics and Autonomous Systems, vol.35, issue.2, pp.97-108, 2001.
DOI : 10.1016/S0921-8890(00)00126-3

]. M. Mandel, D. P. Ellis, and T. Jebara, An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments, Advances in Neural Information Processing Systems 19, pp.953-960, 2007.

]. I. Mccowan, D. Gatica-perez, G. Lathoud, F. Monay, D. Moore et al., Modelling Human Interaction in Meetings, ICASSP, p.17, 2003.

]. I. Mccowan, M. Lincoln, and I. Himawan, Microphone Array Shape Calibration in Diffuse Noise Fields, IEEE Transactions on Audio, Speech, and Language Processing, vol.16, issue.3, pp.666-670, 2008.
DOI : 10.1109/TASL.2007.911428

]. H. Mcgurk and J. Macdonald, Hearing lips and seeing voices, Nature, vol.65, issue.5588, pp.746-748, 1976.
DOI : 10.1038/264746a0

]. G. Mclachlan and D. Peel, Finite mixture models, p.45, 2000.
DOI : 10.1002/0471721182

]. G. Mclachlan and T. Krishnan, The EM algorithm and extensions, p.66, 2007.

]. M. Meila and D. Heckerman, An Experimental Comparison of Model-based Clustering Methods, Machine Learning, pp.9-29, 2001.

]. G. Meyer and S. Wuerger, Cross-modal integration of auditory and visual motion signals, Neuroreport, vol.12, issue.11, pp.2557-2600, 2001.
DOI : 10.1097/00001756-200108080-00053

]. G. Meyer, S. M. Wuerger, F. Röhrbein, and C. Zetzsche, Low-level integration of auditory and visual motion signals requires spatial co-localisation, Experimental Brain Research, vol.65, issue.4, pp.538-547, 2005.
DOI : 10.1007/s00221-005-2394-7

]. H. Mitchell, Multi-sensor data fusion, p.43, 2007.

H. W. Naus and C. V. Van-wijk, Simultaneous Localization of Multiple Emitters, IEE Proceedings Radar Sonar and Navigation, pp.65-70, 2004.

]. A. Nefian-2002, L. Nefian, X. Liang, X. Pi, K. Liu et al., Dynamic Bayesian Networks for Audio-Visual Speech Recognition, EURASIP Journal on Advances in Signal Processing, vol.2002, issue.11, pp.1274-1288, 2002.
DOI : 10.1155/S1110865702206083

]. M. Nevelson and R. Z. Khasminskii, Stochastic approximation and recursive estimation, Translated from Russian, p.118, 1976.

]. K. Nickel, T. Gehrig, R. Stiefelhagen, and J. Mcdonough, A joint particle filter for audio-visual speaker tracking, Proceedings of the 7th international conference on Multimodal interfaces , ICMI '05, pp.61-68, 2005.
DOI : 10.1145/1088463.1088477

]. B. Pannetier, J. Dezert, and E. Pollard, Improvement of Multiple Ground Targets Tracking with GMTI Sensor and Fusion of Identification Attributes, 2008 IEEE Aerospace Conference, pp.1-13, 2008.
DOI : 10.1109/AERO.2008.4526437

]. R. Patterson, K. Robinson, J. Holdsworth, D. Mckeown, C. Zhang et al., Complex Sounds and Auditory Images, Auditory Physiology and Perception, pp.429-446, 1992.
DOI : 10.1016/B978-0-08-041847-6.50054-X

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.333.2780

]. E. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, CUAVE: A New Audio-Visual Database for Multimodal Human-Computer Interface Research, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), p.17, 2002.

]. D. Peel and G. J. Mclachlan, Robust mixture modelling using the t distribution, Statistics and Computing, vol.10, issue.4, pp.339-348, 2000.
DOI : 10.1023/A:1008981510081

]. P. Perez, J. Vermaak, and A. Blake, Data Fusion for Visual Tracking With Particles, Proceedings of IEEE, pp.495-513, 2004.
DOI : 10.1109/JPROC.2003.823147

]. B. Polyak, Introduction to optimization, pp.58-69, 1987.

]. A. Pouget, S. Deneve, and J. Duhamel, A computational perspective on the neural basis of multisensory spatial representations, Nature Reviews Neuroscience, vol.83, issue.9, pp.741-747, 2002.
DOI : 10.1038/nrn914

]. A. Pouget, J. C. Ducom, J. Torri, and D. Bavelier, Multisensory spatial representations in eye-centered coordinates for reaching, Cognition, vol.83, issue.1, pp.1-11, 2002.
DOI : 10.1016/S0010-0277(01)00163-9

]. V. Raykar and R. Duraiswami, Automatic position calibration of multiple microphones, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.69-72, 2004.
DOI : 10.1109/ICASSP.2004.1326765

]. J. Rissanen, Modeling by shortest data description, Automatica, vol.14, issue.5, pp.465-471, 1978.
DOI : 10.1016/0005-1098(78)90005-5

]. B. Rozovskii, Stochastic evolution systems Mathematics and its Applications, p.117, 1990.

]. X. Shao and J. Barker, Stream weight estimation for multistream audio???visual speech recognition in a multispeaker environment, Speech Communication, vol.50, issue.4, pp.337-353, 2008.
DOI : 10.1016/j.specom.2007.11.002

URL : https://hal.archives-ouvertes.fr/hal-00499201

]. J. Shi and C. Tomasi, Good Features to Track, CVPR, pp.593-600, 1994.

]. M. Slaney, An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank, p.16, 1993.

]. D. Smith and S. Singh, Approaches to Multisensor Data Fusion in Target Tracking: A Survey, IEEE Transactions on Knowledge and Data Engineering, vol.18, issue.12, pp.1041-4347, 2006.
DOI : 10.1109/TKDE.2006.183

]. J. Spall, Introduction to stochastic searchand optimization: Estimation, simulation and control, pp.34-58, 2003.

]. C. Spence and J. Driver, Crossmodal space and crossmodal attention, 2004.
DOI : 10.1093/acprof:oso/9780198524861.001.0001

]. L. Spinello, R. Triebel, and R. Siegwart, Multimodal Detection and Tracking of Pedestrians in Urban environments with Explicit Groung Plane Extraction, IROS, pp.1823-1829, 2008.

]. T. Stanford and B. E. Stein, Superadditivity in multisensory integration: putting the computation in context, NeuroReport, vol.18, issue.8, pp.787-792, 2007.
DOI : 10.1097/WNR.0b013e3280c1e315

]. B. Stein, W. S. Huneycutt, and M. A. Meredith, Neurons and behavior: the same rules of multisensory integration apply, Brain Research, vol.448, issue.2, pp.355-358, 1988.
DOI : 10.1016/0006-8993(88)91276-0

]. B. Stein and M. A. Meredith, The merging of the senses, 1993.

]. B. Stein and T. R. Stanford, Multisensory integration: current issues from the perspective of the single neuron, Nature Reviews Neuroscience, vol.31, issue.4, pp.255-266, 2008.
DOI : 10.1016/j.neuron.2007.12.013

]. T. Svoboda, D. Martinec, and T. Pajdla, A Convenient Multicamera Self-Calibration for Virtual Environments, Presence: Teleoperators and Virtual Environments, vol.2, issue.4, pp.407-422, 1923.
DOI : 10.1109/34.888718

]. N. Ueda and R. Nakano, Deterministic annealing EM algorithm, Neural Networks, vol.11, issue.2, pp.271-282, 1998.
DOI : 10.1016/S0893-6080(97)00133-0

]. N. Van-kampen, Stochastic processes in physics and chemistry. North Holland, 3 ´ edition, p.90, 2007.

]. J. Vermaak, M. Ganget, A. Blake, and P. Pérez, Sequential Monte Carlo fusion of sound and vision for speaker tracking, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, pp.741-746, 2001.
DOI : 10.1109/ICCV.2001.937600

]. E. Wan and R. Van-der-merwe, The unscented Kalman filter for non-linear estimation, Proc. Symp. on Adaptive Syst. for Sign. Proc., Comm. nd Control, p.116, 2000.

]. D. Wang and G. J. Brown, Computational auditory scene analysis: Principles, algorithms, and applications, p.55, 2006.
DOI : 10.1109/9780470043387

]. M. Wasan, Stochastic approximation, p.118, 1969.

]. K. Wilson, N. Checka, D. Demirdjian, and T. Darrell, Audio-video array source separation for perceptual user interfaces, Proceedings of the 2001 workshop on Percetive user interfaces , PUI '01, pp.1-7, 2001.
DOI : 10.1145/971478.971500

]. Xu, Comparative Analysis on Convergence Rates of The EM Algorithm and Its Two Modifications for Gaussian Mixtures, Neural Processing Letters, vol.6, issue.3, pp.69-76, 1997.
DOI : 10.1023/A:1009627306313

]. J. Yao and J. Odobez, Multi-Camera Multi-Person 3D Space Tracking with MCMC in Surveillance Scenarios, M2SFA2, p.119, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00326747

Z. Zeng, J. Tu, M. Liu, T. Huang, B. Pianfetti et al., Audio-Visual Affect Recognition, IEEE Transactions on Multimedia, vol.9, issue.2, pp.424-428, 2007.
DOI : 10.1109/TMM.2006.886310

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.112.2771

]. A. Zhigljavsky, Theory of global random search, pp.74-127, 1991.
DOI : 10.1007/978-94-011-3436-1

]. A. Zhigljavsky and A. Zilinskas, Stochastic Global Optimization, 2008.
DOI : 10.1007/978-3-642-04898-2_570

]. D. Zotkin, R. Duraiswami, and L. S. Davis, Joint Audio-Visual Tracking Using Particle Filters, EURASIP Journal on Advances in Signal Processing, vol.2002, issue.11, pp.1154-1164, 2002.
DOI : 10.1155/S1110865702206058