, Radu Horaud, Tracking Multiple Audio Sources with the von Mises Distribution and Variational EM, A.5 PUBLICATIONS AND SUBMISSIONS Here are is the list of papers that have been published or submitted during my PhD. JOURNAL SUBMISSIONS: ? [10] Yutong Ban, 2019.

Y. Ban, X. Alameda-pineda, L. Girin, and R. Horaud, Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers, Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01950866

X. Li, *. , Y. Ban, *. , L. Girin et al., Online Localization and Tracking of Multiple Speakers in Reverberant Environments, IEEE Journal on Selected Topics in Signal Processing, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01851985

C. Papers,

Y. Ban, X. Li, X. Alameda-pineda, L. Girin, and R. Horaud, Accounting for Room Acoustics in Audio-Visual Multi-Speaker Tracking, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01718114

Y. Ban, L. Girin, X. Alameda-pineda, and R. Horaud, Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking, ICCV Workshop on Computer Vision for Audio-Visual Media, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01577965

Y. Ban, X. Alameda-pineda, F. Badeig, S. Ba, and R. Horaud, Tracking a Varying Number of People with a Visually-Controlled Robotic Head, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
URL : https://hal.archives-ouvertes.fr/hal-01542987

Y. Ban, S. Ba, X. Alameda-pineda, and R. Horaud, Tracking Multiple Persons Based on a Variational Bayesian Model, Radu Horaud, DeepMOT: A Differentiable Framework for Training Multiple Object Trackers, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01359559

, Intervals in the trajectories are speaking pauses. (b)-(e) One-dimensional heat maps as a function of time for the four tested localization methods. (f) Results for the proposed VEMbased tracker. Black and red colors demonstrate a successful tracking, i.e. continuity of the tracks despite of speech pauses

, Results of speaker localization and tracking for one sequence of the Kinovis-MST dataset. (a) Ground truth trajectory and voice activity (red for speaker 1, black for speaker 2, blue for speaker 3). (b)-(e) One-dimensional heat maps as a function of time for the four tested localization methods. (f) Results for the proposed VEM-based tracker

, GM-FO [74], vM-VEM (proposed) and ground-truth trajectories. Different colors represent different audio sources. Note that vM-PHD is unable to associate sources with trajectories, Results obtained with recordings #1 (left) and #2 (right) from Task 6 of the LOCATA dataset. Top-to-down: vM-PHD, vol.107

, Evaluation of the proposed multiple-person tracking method with different features on the seven sequences of the MOT16 test dataset, p.45

, MOT scores for the meeting-room sequences (partial camera field of view, p.76

, DER (diarization error rate) scores obtained with the AVDIAR dataset, p.81

, Localization and tracking results for the LOCATA data, p.105

, Localization and tracking results for the Kinovis-MST dataset, p.107

, Method evaluation with the LOCATA dataset

J. Don-joven-agravante, F. Pages, and . Chaumette, Visual servoing for the reem humanoid robot's upper body, IEEE International Conference on Robotics and Automation, 2013.

X. Alameda, -. Pineda, and R. Horaud, A geometric approach to sound source localization from time-delay estimates, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, issue.6, pp.1082-1095, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00910081

X. Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland et al., Speaker diarization: A review of recent research, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.2, pp.356-370, 2012.

M. Antonelli, M. Pobil, and . Rucci, Bayesian multimodal integration in a robot replicating human head and eye movements, IEEE International Conference on Robotics and Automation, 2014.

Y. Avargel and I. Cohen, System identification in the short-time Fourier transform domain with crossband filtering, IEEE Transactions on Audio, Speech, and Language Processing, vol.15, issue.4, pp.1305-1319, 2007.

S. Ba, X. Alameda-pineda, A. Xompero, and R. Horaud, An online variational Bayesian model for multi-person tracking from cluttered scenes. Computer Vision and Image Understanding, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01349763

S. Ba, X. Alameda-pineda, A. Xompero, and R. Horaud, An online variational Bayesian model for multi-person tracking from cluttered scenes, Computer Vision and Image Understanding, vol.153, pp.64-76, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01349763

S. Bae and K. Yoon, Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.3, pp.595-610, 2018.

Y. Ban, X. Alameda-pineda, F. Badeig, S. Ba, and R. Horaud, Tracking a varying number of people with a visually-controlled robotic head, IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.4144-4151, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01542987

Y. Ban, X. Alameda-pineda, C. Evers, and R. Horaud, Tracking multiple audio sources with the von mises distribution and variational em, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01969050

Y. Ban, X. Alameda-pineda, L. Girin, and R. Horaud, Variational bayesian inference for audio-visual tracking of multiple speakers, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01950866

Y. Ban, S. Ba, X. Alameda-pineda, and R. Horaud, Tracking multiple persons based on a variational Bayesian model, European Conference on Computer Vision Workshops, pp.52-67, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01359559

Y. Ban, L. Girin, X. Alameda-pineda, and R. Horaud, Exploiting the complementarity of audio and visual data in multi-speaker tracking, IEEE ICCV Workshop on Computer Vision for Audio-Visual Media, pp.446-454, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01577965

Y. Ban, X. Li, X. Alameda-pineda, L. Girin, and R. Horaud, Accounting for room acoustics in audio-visual multi-speaker tracking, IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01718114

Y. Bar-shalom, F. Daum, and J. Huang, The probabilistic data association filter: estimation in the presence of measurement origin and uncertainty, IEEE Control System Magazine, vol.29, issue.6, pp.82-100, 2009.

Y. Bar-shalom, Multitarget-multisensor tracking: advanced applications, 1990.

Y. Bar-shalom, K. Peter, X. Willett, and . Tian, Tracking and data fusion, YBS publishing Storrs, 2011.

D. Bechler and . Grimm, Speaker tracking with a microphone array using Kalman filtering, Advances in Radio Science, pp.113-117, 2003.

C. Bishop, Pattern Recognition and Machine Learning, 2006.

C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics, 2007.

M. Christopher and . Bishop, Pattern recognition. Machine Learning, vol.128, 2006.

S. Blackman and R. Popoli, Design and analysis of modern tracking systems (artech house radar library). Artech house, 1999.

S. Samuel and . Blackman, Multiple hypothesis tracking for multiple target tracking. IEEE Aerospace and Electronic Systems Magazine, vol.19, pp.5-18, 2004.

Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime multi-person 2D pose estimation using part affinity fields, IEEE Conference on Computer Vision and Pattern Recognition, pp.7291-7299, 2017.

V. Cevher, R. Velmurugan, and J. Mcclellan, Acoustic multitarget tracking using direction-of-arrival batches, IEEE Transactions on Signal Processing, vol.55, issue.6, pp.2810-2825, 2007.

N. Checka, K. Wilson, M. Siracusa, and T. Darrell, Multiple person and speaker activity tracking with a particle filter, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.881-884, 2004.

J. Chen, J. Benesty, and Y. Huang, Time delay estimation in room acoustic environments: an overview, EURASIP Journal on applied signal processing, pp.170-170, 2006.

W. Choi, Near-online multi-target tracking with aggregated local flow descriptor, IEEE International Conference on Computer Vision, 2015.

W. Choi, C. Pantofaru, and S. Savarese, A general framework for tracking multiple people from a moving camera, IEEE transactions on pattern analysis and machine intelligence, vol.35, 2013.

D. E. Clark and J. Bell, Convergence results for the particle PHD filter, IEEE Transactions on Signal Processing, vol.54, issue.7, pp.2652-2661, 2006.

I. J. Cox and S. L. Hingorani, An efficient implementation of reid's multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking, IEEE Transactions, vol.18, issue.2, pp.138-150, 1996.

A. Cretual and F. Chaumette, Application of motion-based visual servoing to target tracking, The International Journal of Robotics Research, vol.20, issue.11, 2001.

A. Crétual, F. Chaumette, and P. Bouthemy, Complex object tracking by visual servoing based on 2d image motion, International Conference on Pattern Recognition, vol.2, 1998.

A. Deleforge, F. Forbes, and R. Horaud, High-dimensional regression with Gaussian mixtures and partially-latent response variables, Statistics and Computing, vol.25, issue.5, pp.893-911, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01107604

A. Deleforge, R. Horaud, Y. Y. Schechner, and L. Girin, Co-localization of audio sources in images using binaural features and locally-linear regression, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.23, issue.4, pp.718-731, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01112834

P. Arthur, N. M. Dempster, D. Laird, and . Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society: Series B (Methodological), vol.39, issue.1, pp.1-22, 1977.

H. Joseph, . Dibiase, F. Harvey, M. S. Silverman, and . Brandstein, Robust localization in reverberant rooms, Microphone Arrays, pp.157-180, 2001.

C. Dicle, O. I. Camps, and M. Sznaier, The way they move: Tracking multiple targets with similar appearance, Proceedings of the IEEE international conference on computer vision, pp.2304-2311, 2013.

S. Doclo and M. Moonen, Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments, EURASIP Journal on Applied Signal Processing, pp.1110-1124, 2003.

Y. Dorfan and S. Gannot, Tree-based recursive expectation-maximization algorithm for localization of acoustic sources, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.23, issue.10, pp.1692-1703, 2015.

Y. Dorfan and S. Gannot, Tree-based recursive expectation-maximization algorithm for localization of acoustic sources, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, issue.10, pp.1692-1703, 2015.

G. Tsvi, S. Dvorkind, and . Gannot, Time difference of arrival estimation of speech source in a noisy and reverberant environment, Signal Processing, vol.85, issue.1, pp.177-204, 2005.

B. Espiau, F. Chaumette, and P. Rives, A new approach to visual servoing in robotics, IEEE Transactions on Robotics and Automation, vol.8, issue.3, 1993.

C. Evers, A. P. Emanuel, S. Habets, P. Gannot, and . Naylor, DoA reliability for distributed acoustic tracking, IEEE Signal Processing Letters, 2018.

C. Evers, A. H. Moore, P. A. Naylor, J. Sheaffer, and B. Rafaely, Bearing-only acoustic tracking of moving speakers for robot audition, IEEE International Conference on Digital Signal Processing (DSP), pp.1206-1210, 2015.

C. Evers, A. Patrick, and . Naylor, Acoustic SLAM, Speech, and Language Processing, vol.26, pp.1484-1498, 2018.

L. Fagot-bouquet, R. Audigier, Y. Dhome, and F. Lerasle, Improving multi-frame data association with sparse representations for robust nearonline multi-object tracking, European Conference on Computer Vision, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01763166

F. Maurice, S. J. Fallon, and . Godsill, Acoustic source localization and tracking of a time-varying number of speakers, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.4, pp.1409-1415, 2012.

C. Gaskett, L. Fletcher, and A. Zelinsky, Reinforcement learning for visual servoing of a mobile robot, Australian Conference on Robotics and Automation, 2000.

D. Gatica-perez, G. Lathoud, J. Odobez, and I. Mccowan, Audiovisual probabilistic tracking of multiple speakers in meetings, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.2, pp.601-616, 2007.

I. D. Gebru, X. Alameda-pineda, F. Forbes, and R. Horaud, EM algorithms for weighted-data clustering with application to audio-visual scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.12, pp.2402-2415, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01261374

I. Gebru, S. Ba, X. Li, and R. Horaud, Audio-visual speaker diarization based on spatiotemporal Bayesian fusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01413403

D. Israel, S. Gebru, X. Ba, R. Li, and . Horaud, Audio-visual speaker diarization based on spatiotemporal Bayesian fusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.5, pp.1086-1099, 2018.

X. Israel-dejene-gebru, F. Alameda-pineda, R. Forbes, and . Horaud, Em algorithms for weighted-data clustering with application to audio-visual scene analysis, IEEE transactions on pattern analysis and machine intelligence, vol.38, pp.2402-2415, 2016.

A. Geiger, M. Lauer, and C. Wojek, Christoph Stiller, and Raquel Urtasun. 3d traffic scene understanding from movable platforms, IEEE transactions on pattern analysis and machine intelligence, vol.36, pp.1012-1025, 2014.

B. Gold, N. Morgan, and D. Ellis, Speech and audio signal processing: processing and perception of speech and music, 2011.

P. J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika, vol.82, issue.4, pp.711-732, 1995.

P. J. Green, Trans-dimensional Markov chain Monte Carlo, Oxford Statistical Science Series, pp.179-198, 2003.

A. Seyed-hamid-rezatofighi, Z. Milan, Q. Zhang, A. Shi, I. Dick et al., Joint probabilistic data association revisited, Proceedings of the IEEE international conference on computer vision, pp.3047-3055, 2015.

A. Heili, A. Lopez-mendez, and J. Odobez, Exploiting long-term connectivity and visual motion in CRF-based multi-person tracking, IEEE Transactions on Image Processing, vol.23, issue.7, pp.3040-3056, 2014.

T. M. Hospedales and S. Vijayakumar, Structure inference for Bayesian multisensory scene understanding, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30, issue.12, pp.2140-2157, 2008.

Y. Huang and J. Benesty, Adaptive multichannel time delay estimation based on blind system identification for acoustic source localization, Adaptive Signal Processing, pp.227-247, 2003.

S. Hutchinson, D. Gregory, . Hager, and . Peter-i-corke, A tutorial on visual servo control, IEEE Transactions on Robotics and Automation, vol.12, issue.5, 1996.

T. Carlos, O. Ishi, H. Chatot, N. Ishiguro, and . Hagita, Evaluation of a music-based real-time sound localization of multiple sound sources in real noisy environments, IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.2027-2032, 2009.

R. E. and K. , A new approach to linear filtering and prediction problems, Journal of basic Engineering, vol.82, issue.1, pp.35-45, 1960.

Z. Khan, T. Balch, and F. Dellaert, An MCMC-based particle filter for tracking multiple interacting targets, European Conference on Computer Vision, pp.279-290, 2004.

V. K?l?ç, M. Barnard, W. Wang, A. Hilton, and J. Kittler, Meanshift and sparse sampling-based SMC-PHD filtering for audio informed visual speaker tracking, IEEE Transactions on Multimedia, vol.18, issue.12, p.2417, 2016.

V. K?l?ç, M. Barnard, W. Wang, and J. Kittler, Audio assisted robust visual tracking with adaptive particle filtering, IEEE Transactions on Multimedia, vol.17, issue.2, pp.186-200, 2015.

J. Kivinen and M. K. Warmuth, Exponentiated gradient versus gradient descent for linear predictors, Information and Computation, vol.132, issue.1, pp.1-63, 1997.

C. Knapp and G. Carter, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech and Signal Processing, vol.24, issue.4, pp.320-327, 1976.

K. Kowalczyk, A. P. Emanuël, W. Habets, P. Kellermann, and . Naylor, Blind system identification using sparse learning for TDOA estimation of room reflections, IEEE Signal Processing Letters, vol.20, issue.7, pp.653-656, 2013.

T. Kuiren, Issues in the design of practical multitarget tracking algorithms. Multitarget-multisensor tracking: advanced applications, pp.43-87, 1990.

G. Lathoud, J. Odobez, and D. Gatica-perez, AV16.3: an audio-visual corpus for speaker localization and tracking, Machine Learning for Multimodal Interaction, pp.182-195, 2004.

X. Li, Y. Ban, L. Girin, X. Alameda-pineda, and R. Horaud, Online localization and tracking of multiple moving speakers in reverberant environment, 2018.

X. Li, L. Girin, F. Badeig, and R. Horaud, Reverberant sound localization with a robot head based on direct-path relative transfer function, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.2819-2826, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01349771

X. Li, L. Girin, R. Horaud, and S. Gannot, Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.320-324, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01119186

X. Li, L. Girin, R. Horaud, and S. Gannot, Estimation of the direct-path relative transfer function for supervised sound-source localization, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, issue.11, pp.2171-2186, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01349691

X. Li, L. Girin, R. Horaud, and S. Gannot, Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regularization, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, issue.10, pp.1997-2012, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01413417

X. Li, L. Girin, R. Horaud, S. Gannot, X. Li et al., Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regularization, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.25, issue.10, pp.1997-2012, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01413417

X. Li, R. Horaud, L. Girin, and S. Gannot, Voice activity detection based on statistical likelihood ratio with adaptive thresholding, IEEE International Workshop on Acoustic Signal Enhancement, pp.1-5, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01349776

X. Li, B. Mourgue, L. Girin, S. Gannot, and R. Horaud, Online localization of multiple moving speakers in reverberant environments, The Tenth IEEE Workshop on Sensor Array and Multichannel Signal Processing, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01795462

Z. Liang, X. Ma, and X. Dai, Robust tracking of moving sound source using multiple model Kalman filter, Applied acoustics, vol.69, issue.12, pp.1350-1355, 2008.

Y. Liu, A. Hilton, J. Chambers, Y. Zhao, and W. Wang, Non-zero diffusion particle flow smc-phd filter for audio-visual multi-speaker tracking, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2018.

Y. Liu, Q. Hu, Z. Yuexian, and W. Wang, Labelled non-zero particle flow for smc-phd filtering, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

Y. Liu, W. Wang, J. Chambers, V. Kilic, and A. Hilton, Particle flow SMC-PHD filter for audio-visual multi-speaker tracking, International Conference on Latent Variable Analysis and Signal Separation, pp.344-353, 2017.

Y. Liu, W. Wang, and V. Kilic, Intensity particle flow smc-phd filter for audio speaker tracking, LOCATA Workshop, 2018.

Y. Liu, W. Wang, and Y. Zhao, Particle flow for sequential monte carlo implementation of probability hypothesis density, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4371-4375, 2017.

H. W. Löllmann, C. Evers, A. Schmidt, H. Mellmann, H. Barfuss et al., The LOCATA challenge data corpus for acoustic source localization and tracking, IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), 2018.

A. Lombard, Y. Zheng, H. Buchner, and W. Kellermann, TDOA estimation for multiple sound sources in noisy and reverberant environments using broadband independent component analysis, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.6, pp.1490-1503, 2011.

W. Longyin, D. Du, Z. Cai, Z. Lei, M. Chang et al., DETRAC filter multiple target tracker: A new benchmark and protocol for multi-object tracking, 2015.

W. Luo, J. Xing, X. Zhang, W. Zhao, and T. Kim, Multiple object tracking: a review, 2015.

W. K. Ma, B. N. Vo, and S. S. Singh, Tracking an unknown time-varying number of speakers using TDOA measurements: a random finite set approach, IEEE Transactions on Signal Processing, vol.54, issue.9, pp.3291-3304, 2006.

. Wing-kin, B. Ma, . Vo, S. Sumeetpal, A. Singh et al., Tracking an unknown time-varying number of speakers using TDOA measurements: A random finite set approach, IEEE Transactions on Signal Processing, vol.54, issue.9, pp.3291-3304, 2006.

Y. Ma and A. Nishihara, Efficient voice activity detection algorithm using long-term spectral flatness measure, EURASIP Journal on Audio, Speech, and Music Processing, vol.2013, issue.1, pp.1-18, 2013.

E. Maggio, M. Taj, and A. Cavallaro, Efficient multitarget visual tracking using random finite sets, IEEE Transactions on Circuits and Systems for Video Technology, vol.18, issue.8, pp.1016-1027, 2008.

R. P. Mahler, A theoretical foundation for the Stein-Winter" probability hypothesis density (PHD)" multitarget tracking approach, 2000.

R. P. Mahler, Multitarget Bayes filtering via first-order multitarget moments, IEEE Trans. Aerosp. Electron. Syst, vol.39, issue.4, pp.1152-1178, 2003.

R. P. Mahler, Statistics 101 for multisensor, multitarget data fusion. IEEE Aerospace and Electronic Systems Magazine, vol.19, pp.53-64, 2004.

R. P. Mahler, Statistics 102 for multisensor multitarget data fusion, IEEE Selected Topics on Signal Processing, vol.19, issue.1, pp.53-64, 2013.

R. Mahler, Phd filters of higher order in target number, IEEE Transactions on Aerospace and Electronic systems, vol.43, issue.4, 2007.

P. S. Ronald and . Mahler, Multisource multitarget filtering: a unified approach, International Society for Optics and Photonics, pp.296-307, 1998.

P. S. Ronald and . Mahler, Multitarget bayes filtering via first-order multitarget moments, IEEE Transactions on Aerospace and Electronic systems, vol.39, issue.4, pp.1152-1178, 2003.

E. Malis, F. Chaumette, and S. Boudet, 2 1/2 d visual servoing, IEEE Transactions on Robotics and Automation, vol.15, issue.2, 1999.
URL : https://hal.archives-ouvertes.fr/inria-00073302

R. J. Michael-i-mandel, . Weiss, and . Ellis, Model-based expectationmaximization source separation and localization, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.2, pp.382-394, 2010.

É. Marchand and F. Chaumette, Feature tracking for visual servoing purposes, Robotics and Autonomous Systems, vol.52, issue.1, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00351898

V. Kanti, P. E. Mardia, and . Jupp, Directional statistics, vol.494, 2009.

I. Markovi?, J. , and I. Petrovi?, Von Mises mixture PHD filter, IEEE Signal Processing Letters, vol.22, issue.12, pp.2229-2233, 2015.

I. Markovi? and I. Petrovi?, Bearing-only tracking with a mixture of von Mises distributions, IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.707-712, 2012.

L. Mejias, S. Saripalli, P. Campoy, and G. Sukhatme, Visual servoing of an autonomous helicopter in urban areas using feature tracking, Journal of Field Robotics, vol.23, pp.3-4, 2006.

A. Milan, L. Leal-taixé, I. Reid, S. Roth, and K. Schindler, Mot16: A benchmark for multi-object tracking, 2016.

A. Milan, S. Roth, and K. Schindler, Continuous energy minimization for multitarget tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.36, issue.1, pp.58-72, 2014.

A. Milan, S. Roth, and K. Schindler, Continuous energy minimization for multitarget tracking, IEEE Trans. Pattern Anal. Mach. Intell, vol.36, issue.1, pp.58-72, 2014.

C. R. Vicente-peruffo-minotto, B. Jung, and . Lee, Multimodal multichannel on-line speaker diarization using sensor fusion through SVM, IEEE Transactions on Multimedia, vol.17, issue.10, pp.1694-1705, 2015.

A. Abou-moughlbay, E. Cervera, and P. Martinet, Error regulation strategies for model based visual servoing tasks: Application to autonomous object grasping with nao robot, International Conference on Control Automation Robotics & Vision, 2012.

J. Munkres, Algorithms for the assignment and transportation problems, Journal of the society for industrial and applied mathematics, vol.5, issue.1, pp.32-38, 1957.

S. M. Naqvi, M. Yu, and J. A. Chambers, A multimodal approach to blind source separation of moving sources, IEEE Journal of Selected Topics in Signal Processing, vol.4, issue.5, pp.895-910, 2010.

A. Noulas, G. Englebienne, and B. J. Krose, Multimodal speaker diarization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.1, pp.79-93, 2012.

D. Omr?en and A. Ude, Redundant control of a humanoid robot head with foveated vision for object tracking, IEEE International Conference on Robotics and Automation, 2010.

D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris, Real-time multiple sound source localization and counting using a circular microphone array, IEEE Transactions on Audio, Speech, and Language Processing, vol.21, issue.10, pp.2193-2206, 2013.
URL : https://hal.archives-ouvertes.fr/hal-01367320

H. Pirsiavash, D. Ramanan, and C. Fowlkes, Globally-optimal greedy algorithms for tracking a variable number of objects, Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp.1201-1208, 2011.

X. Qian, A. Brutti, M. Omologo, and A. Cavallaro, 3D audio-visual speaker tracking with an adaptive particle filter, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.2896-2900, 2017.

X. Qian, A. Cavallaro, A. Brutti, and M. Omologo, Locata challenge: speaker localization with a planar array, LOCATA Workshop, 2018.

X. Qian, A. Xompero, A. Brutti, and O. Lanz, Maurizio Omologo, and Andrea Cavallaro. 3d mouth tracking from a compact microphone array co-located with a camera, International Conference on Acoustics, Speech and Signal Processing, 2018.

J. Rajruangrabin and D. O. Popa, Robot head motion control with an emphasis on realism of neck-eye coordination during object tracking, Journal of Intelligent & Robotic Systems, vol.63, issue.2, 2011.

D. Reid, An algorithm for tracking multiple targets, IEEE transactions on Automatic Control, vol.24, issue.6, pp.843-854, 1979.

B. Ristic, B. Vo, and D. Clark, Performance evaluation of multi-target tracking using the OSPA metric, IEEE International Conference on Information Fusion, pp.1-7, 2010.

N. Roman and D. Wang, Binaural tracking of multiple moving sources, IEEE Transactions on Audio, Speech, and Language Processing, vol.16, issue.4, pp.728-739, 2008.

J. Satake and J. Miura, Robust stereo-based person detection and tracking for a person following robot, IEEE ICRA Workshop on People Detection and Tracking, 2009.

N. Schult, T. Reineking, T. Kluss, and C. Zetzsche, Information-driven active audio-visual source localization, PloS one, vol.10, issue.9, 2015.

D. Schulz, W. Burgard, D. Fox, and A. Cremers, Tracking multiple moving targets with a mobile robot using particle filters and statistical data association, IEEE International Conference on Robotics and Automation, 2001.

O. Schwartz and S. Gannot, Speaker tracking using recursive EM algorithms, Speech, and Language Processing, vol.22, pp.392-402, 2014.

H. Sidenbladh, Multi-target particle filtering for the probability hypothesis density, IEEE International Conference on Information Fusion, pp.800-806, 2003.

V. Smidl and A. Quinn, The Variational Bayes Method in Signal Processing, 2006.

V. Smidl and A. Quinn, The Variational Bayes Method in Signal Processing, 2006.

A. Václav?mídl and . Quinn, The variational Bayes method in signal processing. Signals and communication technology, 2006.

K. Smith, D. Gatica-perez, and J. Odobez, Using particles to track varying numbers of interacting people, IEEE Computer Vision and Pattern Recognition, pp.962-969, 2005.

K. Smith, D. Gatica-perez, J. Odobez, and S. Ba, Evaluating multi-object tracking, IEEE CVPR Workshop on Empirical Evaluation Methods in Computer Vision, pp.36-36, 2005.

R. Stiefelhagen, K. Bernardin, R. Bowers, J. S. Garofolo, D. Mostefa et al., CLEAR 2006 evaluation, First International Workshop on Classification of Events and Relationship, 2005.

R. Stiefelhagen, K. Bernardin, R. Bowers, and J. Garofolo, Djamel Mostefa, and Padmanabhan Soundararajan. The clear 2006 evaluation, International Evaluation Workshop on Classification of Events, Activities and Relationships, 2006.

R. Stolkin, I. Florescu, M. Baron, C. Harrier, and B. Kocherov, Efficient visual servoing with the abcshift tracking algorithm, IEEE International Conference on Robotics and Automation, 2008.

R. Talmon, I. Cohen, and S. Gannot, Relative transfer function identification using convolutive transfer function approximation, IEEE Transactions on Audio, Speech, and Language Processing, vol.17, issue.4, pp.546-555, 2009.

J. Traa and P. Smaragdis, A wrapped Kalman filter for azimuthal speaker tracking, IEEE Signal Processing Letters, vol.20, issue.12, pp.1257-1260, 2013.

J. Valin, F. Michaud, and J. Rouat, Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering, Robotics and Autonomous Systems, vol.55, issue.3, pp.216-228, 2007.

L. Vannucci, N. Cauli, E. Falotico, A. Bernardino, and C. Laschi, Adaptive visual pursuit involving eye-head coordination and prediction of the target motion, IEEE-RAS International Conference on Humanoid Robots, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01118539

J. Vermaak, N. D. Lawrence, and P. Perez, Variational inference for visual tracking, IEEE Conference on Computer Vision and Pattern Recognition, pp.773-780, 2003.

J. Vermaak and A. Blake, Nonlinear filtering for speaker tracking in noisy and reverberant environments, IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.5, pp.3021-3024, 2001.

D. Vijayasenan and F. Valente, DiarTk: an open source toolkit for research in multistream speaker diarization and its application to meeting recordings, INTERSPEECH, pp.2170-2173, 2012.

B. Vo and W. Ma, The Gaussian mixture probability hypothesis density filter, IEEE Transactions on Signal Processing, vol.54, issue.11, pp.4091-4104, 2006.

B. Vo, S. Singh, and A. Doucet, Sequential monte carlo methods for multitarget filtering with random finite sets, IEEE Transactions on Aerospace and electronic systems, vol.41, issue.4, pp.1224-1245, 2005.

. Ba-ngu, M. Vo, Y. Mallick, S. Bar-shalom, R. Coraluppi et al., Multitarget tracking. Wiley Encyclopedia of Electrical and Electronics Engineering, pp.1-15, 2015.

B. Vo, S. Singh, and A. Doucet, Random finite sets and sequential monte carlo methods in multi-target tracking, IEEE International Radar Conference, pp.486-491, 2003.

B. Vo, S. Singh, and W. Ma, Tracking multiple speakers using random sets, Acoustics, Speech, and Signal Processing, vol.2, p.357, 2004.

B. Vo, B. Vo, and D. Phung, Labeled random finite sets and the bayes multi-target tracking filter, IEEE Transactions on Signal Processing, vol.62, issue.24, pp.6554-6567, 2014.

T. Ba, B. Vo, and . Vo, Labeled random finite sets and multi-object conjugate priors, IEEE Transactions on Signal Processing, vol.61, issue.13, pp.3460-3475, 2013.

B. Vo, B. Vo, and A. Cantoni, Analytic implementations of the cardinalized probability hypothesis density filter, IEEE Transactions on Signal Processing, vol.55, issue.7, pp.3553-3567, 2007.

B. Darren, E. A. Ward, R. Lehmann, and . Williamson, Particle filtering algorithms for tracking an acoustic source in a reverberant environment, IEEE Transactions on speech and audio processing, vol.11, issue.6, pp.826-836, 2003.

G. Xu, H. Liu, L. Tong, and T. Kailath, A least-squares approach to blind channel identification, IEEE Transactions on signal processing, vol.43, issue.12, pp.2982-2993, 1995.

Y. Xu, Y. Ban, X. Alameda-pineda, and R. Horaud, Deepmot: A differentiable framework for training multiple object trackers, 2019.

B. Yang and R. Nevatia, An online learned CRF model for multi-target tracking, IEEE Conference on Computer Vision and Pattern Recognition, pp.2034-2041, 2012.

M. Yang, Y. Liu, L. Wen, and Z. You, A probabilistic framework for multitarget tracking with mutual occlusions, IEEE Confenrence on Computer Vision and Pattern Recognition, pp.1298-1305, 2014.

O. Yilmaz and S. Rickard, Blind separation of speech mixtures via timefrequency masking, IEEE Transactions on Signal Processing, vol.52, issue.7, pp.1830-1847, 2004.

L. Alan, A. Yuille, and . Rangarajan, The concave-convex procedure, Neural computation, vol.15, issue.4, pp.915-936, 2003.

L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang et al., Person re-identification in the wild, IEEE Conference on Computer Vision and Pattern Recognition, pp.1367-1376, 2017.

X. Zhong and . James-r-hopgood, Particle filtering for TDOA based acoustic source tracking: Nonconcurrent multiple talkers, Signal Processing, vol.96, pp.382-394, 2014.