. Biwi, The baseline for the Biwi data set is inspired from [44, 93], where the authors Figure A.3: Example of images from the Fashion Landmark Dataset: landmarks detected as outliers by DeepGUM are shown in red, while inliers are shown in green Recognition of group activities in videos based on single-and two-person descriptors, all these images, the detected outliers correspond to occluded landmarks. APPENDIX A. APPENDIX ARTICLES INCLUDED IN THIS MANUSCRIPT: ? [92] StéphaneLathuilì ere, Georgios Evangelidis, and Radu Horaud IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.

R. Juge, P. Mesejo, R. Munoz-salinas, and R. Horaud, Deep mixture of linear inverse regressions applied to head-pose estimation, ? [94] StéphaneLathuilì ere IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
URL : https://hal.archives-ouvertes.fr/hal-01504847

P. Mesejo, X. Alameda-pineda, and R. Horaud, DeepGUM: Deep Robust Regression with Gaussian-Uniform Mixtures, ? [96] StéphaneLathuilì ere Submitted to IEEE European Conference of Computer Vision (ECCV), 2018.

P. Stéphanelathuilì-ere, X. Mesejo, R. Alameda-pineda, and . Horaud, A Comprehensive Analysis of Deep Regression, ? [91], 2018.

. Stéphanelathuilì-ere, A. Shammur-absar-chowdhury, R. Ghosh, N. Vieriu, G. Sebe et al., Depression severity estimation from multiple modalities, 2018.

J. Byungtae-ahn, I. S. Park, and . Kweon, Real-time head orientation from a monocular camera using deep neural network, 2014.

R. Mohamed, S. Amer, and . Todorovic, A chains model for localizing participants of group activities in videos, ICCV, 2011.

M. Rabie-amer, P. Lei, and S. Todorovic, HIRF: Hierarchical random field for collective activity recognition in videos, ECCV, 2014.

M. Andriluka, S. Roth, and B. Schiele, Pictorial structures revisited: People detection and articulated pose estimation, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.1014-1021, 2009.
DOI : 10.1109/CVPR.2009.5206754

URL : http://www.gris.informatik.tu-darmstadt.de/~sroth/pubs/cvpr09andriluka.pdf

F. Badeig, Q. Pelorson, S. Arias, V. Drouard, I. Gebru et al., Georgios Evangelidis, and Radu Horaud. A distributed architecture for interacting with nao, ACM ICMI, 2015.

Y. Ban, X. Alameda-pineda, F. Badeig, S. Ba, and R. Horaud, Tracking a varying number of people with a visually-controlled robotic head, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
DOI : 10.1109/IROS.2017.8206274

URL : https://hal.archives-ouvertes.fr/hal-01542987

D. Jeffrey, . Banfield, E. Adrian, and . Raftery, Model-based gaussian and non-gaussian clustering, Biometrics, 1993.

A. J. Bekker and J. Goldberger, Training deep neural-networks based on unreliable labels, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
DOI : 10.1109/ICASSP.2016.7472164

V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, Robust Optimization for Deep Regression, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.324

URL : http://arxiv.org/pdf/1505.06606

G. Beliakov, A. V. Kelarev, and J. Yearwood, Robust artificial neural networks and outlier detection, 2011.

Y. Bengio, Practical Recommendations for Gradient-Based Training of Deep Architectures, Neural networks: Tricks of the trade, pp.437-478, 2012.
DOI : 10.1162/089976602317318938

URL : http://arxiv.org/pdf/1206.5533.pdf

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, issue.2, 1994.
DOI : 10.1109/72.279181

URL : http://www.research.microsoft.com/~patrice/PDF/long_term.pdf

M. Bennewitz, F. Faber, D. Joho, M. Schreiber, and S. Behnke, Towards a humanoid museum guide robot that interacts with multiple persons, 5th IEEE-RAS International Conference on Humanoid Robots, 2005., pp.418-423, 2005.
DOI : 10.1109/ICHR.2005.1573603

URL : http://www.informatik.uni-freiburg.de/~maren/papers/bennewitz_humanoids05.pdf

J. Michael, A. Black, and . Rangarajan, On the unification of line processes, outlier rejection, and robust statistics with applications in early vision, IJCV, 1996.

A. Bulat and G. Tzimiropoulos, How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks), 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.116

URL : http://arxiv.org/pdf/1703.07332

P. Xavier, P. Burgos-artizzu, P. Perona, and . Dollár, Robust face landmark estimation under occlusion, ICCV, pp.1513-1520, 2013.

G. Bustamante, P. Danés, T. Forgue, and A. Podlubne, Towards information-based feedback control for binaural active localization, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
DOI : 10.1109/ICASSP.2016.7472894

Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.143

URL : http://arxiv.org/pdf/1611.08050

V. Chandrasekhar, J. Lin, O. Morère, H. Goh, and A. Veillard, A practical guide to CNNs and Fisher Vectors for image instance retrieval, Signal Processing, vol.128, pp.426-439, 2016.
DOI : 10.1016/j.sigpro.2016.05.021

URL : http://arxiv.org/pdf/1508.02496

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, JAIR, 2002.

C. Bor-chun-chen, W. H. Chen, and . Hsu, Cross-age reference coding for age-invariant face recognition and retrieval, ECCV, 2014.

W. Choi and S. Savarese, A Unified Framework for Multi-target Tracking and Collective Activity Recognition, ECCV, 2012.
DOI : 10.1007/978-3-642-33765-9_16

URL : http://www.eecs.umich.edu/vision/papers/choi_eccv_12.pdf

W. Choi, Y. Chao, C. Pantofaru, and S. Savarese, Discovering Groups of People in Images, ECCV, 2014.
DOI : 10.1007/978-3-319-10593-2_28

URL : http://cvgl.stanford.edu/projects/groupdiscovery/eccv2014choi.pdf

W. Choi and S. Savarese, Understanding collective activities of people from videos, IEEE TPAMI, 2013.
DOI : 10.1109/tpami.2013.220

W. Choi, K. Shahid, and S. Savarese, What are they doing?: Collective activity classification using spatio-temporal relationship among people, ICCV Workshops, 2009.

W. Conover, Practical Nonparametric Statistics, 1998.

P. Coretto and C. Hennig, Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering, Journal of the American Statistical Association, vol.8, issue.516, 2016.
DOI : 10.1007/3-540-28084-7_79

URL : http://www.tandfonline.com/doi/pdf/10.1080/01621459.2015.1100996?needAccess=true

A. Cretual and F. Chaumette, Application of motion-based visual servoing to target tracking. IJRR, 2001.

F. Cruz, I. German, J. Parisi, S. Twiefel, and . Wermter, Multimodal integration of dynamic audiovisual patterns for an interactive reinforcement learning scenario, IEEE/RSJ IROS, 2016.
DOI : 10.1109/iros.2016.7759137

F. Cupillard, F. Brémond, and M. Thonnat, Group behavior recognition with multiple cameras, Sixth IEEE Workshop on Applications of Computer Vision, 2002. (WACV 2002). Proceedings., 2002.
DOI : 10.1109/ACV.2002.1182178

URL : http://www-sop.inria.fr/orion/personnel/Francois.Bremond/Postscript/acv02.ps

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
DOI : 10.1109/CVPR.2005.177

URL : https://hal.archives-ouvertes.fr/inria-00548512

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
DOI : 10.1109/CVPR.2005.177

URL : https://hal.archives-ouvertes.fr/inria-00548512

M. Dantone, J. Gall, G. Fanelli, and L. Van-gool, Real-time facial feature detection using conditional regression forests, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.2578-2585, 2012.
DOI : 10.1109/CVPR.2012.6247976

A. Deleforge, R. Horaud, Y. Y. Schechner, and L. Girin, Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, issue.4, 2015.
DOI : 10.1109/TASLP.2015.2405475

URL : https://hal.archives-ouvertes.fr/hal-01112834

A. Deleforge and F. Forbes, Siì eye Ba, and Radu Horaud. Hyper-Spectral Image Analysis with Partially-Latent Regression and Spatial Markov Dependencies, IEEE STSP, 2015.

A. Deleforge, F. Forbes, and R. Horaud, High-dimensional regression with gaussian mixtures and partially-latent response variables, Statistics and Computing, vol.19, issue.11, 2015.
DOI : 10.1109/TNN.2008.2003467

URL : https://hal.archives-ouvertes.fr/hal-01107604

M. Demirkus, D. Precup, and J. J. Clark, Hierarchical temporal graphical model for head pose estimation and subsequent attribute classification in real-world videos, Computer Vision and Image Understanding, vol.136, 2015.
DOI : 10.1016/j.cviu.2015.03.005

Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan et al., Deep Structured Models For Group Activity Recognition, Procedings of the British Machine Vision Conference 2015, 2015.
DOI : 10.5244/C.29.179

URL : http://arxiv.org/pdf/1506.04191

J. Derrac, S. Garca, D. Molina, and F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm and Evolutionary Computation, vol.1, issue.1, pp.3-18, 2011.
DOI : 10.1016/j.swevo.2011.02.002

V. Drouard, S. Ba, G. Evangelidis, A. Deleforge, and R. Horaud, Head pose estimation via probabilistic high-dimensional regression, 2015 IEEE International Conference on Image Processing (ICIP), 2015.
DOI : 10.1109/ICIP.2015.7351683

URL : https://hal.archives-ouvertes.fr/hal-01163663

V. Drouard, R. Horaud, and A. Deleforge, Siì eye Ba, and Georgios Evangelidis . Robust head-pose estimation based on partially-latent mixture of linear regressions, IEEE TIP, 2016.

V. Drouard, R. Horaud, and A. Deleforge, Siì eye Ba, and Georgios Evangelidis Robust head-pose estimation based on partially-latent mixture of linear regressions, IEEE TIP, vol.26, issue.3, pp.1428-1440, 2017.
DOI : 10.1109/tip.2017.2654165

URL : http://arxiv.org/pdf/1603.09732

V. Drouard, R. Horaud, and A. Deleforge, Siì eye Ba, and Georgios Evangelidis Robust head-pose estimation based on partially-latent mixture of linear regressions, IEEE TIP, vol.26, issue.3, pp.1428-1440, 2017.
DOI : 10.1109/tip.2017.2654165

URL : http://arxiv.org/pdf/1603.09732

V. Drouard and R. Horaud, Antoine Deleforge, Silx00E8ye Ba, and Georgios Evangelidis. Robust head-pose estimation based on partially-latent mixture of linear regressions, p.2017

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, JMLR, vol.12, issue.7, pp.2121-2159, 2011.

O. J. Dunn, Multiple Comparisons among Means, Journal of the American Statistical Association, vol.25, issue.293, pp.52-64, 1961.
DOI : 10.1214/aoms/1177728724

D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent et al., Why Does Unsupervised Pre-training Help Deep Learning?, pp.625-660, 2010.

G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. Gool, Random Forests for Real Time 3D Face Analysis, International Journal of Computer Vision, vol.41, issue.5, 2013.
DOI : 10.1109/TSMCB.2011.2148711

URL : http://files.is.tue.mpg.de/jgall/download/jgall_RFdepthFace_ijcv12.pdf

G. Fanelli, J. Gall, and L. Van-gool, Real time head pose estimation with random regression forests, CVPR 2011, pp.617-624, 2011.
DOI : 10.1109/CVPR.2011.5995458

R. A. Fisher, Statistical methods for research workers, 1925.

F. Forbes and D. Wraith, A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering, Statistics and Computing, vol.94, issue.1, 2014.
DOI : 10.1016/S0378-3758(00)00208-1

A. Galimzianova and F. Pernus, Bostjan Likar, and Ziga Spiclin. Robust estimation of unbalanced mixture models on samples with outliers. TPAMI, 2015.

J. Gan, L. Li, Y. Zhai, and Y. Liu, Deep self-taught learning for facial beauty prediction, Neurocomputing, vol.144, p.129, 2014.
DOI : 10.1016/j.neucom.2014.05.028

C. Gaskett, L. Fletcher, and A. Zelinsky, Reinforcement learning for visual servoing of a mobile robot, Australian Conference on Robotics and Automation, 2000.

I. Gebru, S. Ba, X. Li, and R. Horaud, Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, issue.5, 2017.
DOI : 10.1109/TPAMI.2017.2648793

URL : https://hal.archives-ouvertes.fr/hal-01413403

A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis. Chapman & Hall/CRC Texts in Statistical Science, 2003.

A. Ghadirzadeh, J. Bütepage, A. Maki, D. Kragic, and M. Björkman, A sensorimotor reinforcement learning framework for physical Human-Robot Interaction, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016.
DOI : 10.1109/IROS.2016.7759417

URL : http://arxiv.org/pdf/1607.07939

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.81

URL : http://arxiv.org/pdf/1311.2524

I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, 2016.

I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. Book in preparation for, 2016.

K. Greff, R. Kumar-srivastava, J. Koutník, R. Bas, J. Steunebrink et al., LSTM: A Search Space Odyssey, IEEE Transactions on Neural Networks and Learning Systems, vol.28, issue.10, pp.2222-2232, 2017.
DOI : 10.1109/TNNLS.2016.2582924

URL : http://arxiv.org/pdf/1503.04069

G. Guo, Y. Fu, R. Charles, . Dyer, S. Thomas et al., Image-based human age estimation by manifold learning and locally adjusted robust regression, IEEE TIP, vol.17, issue.7, pp.1178-1188, 2008.

A. Gupta, A. Kembhavi, S. Larry, and . Davis, Observing humanobject interactions: Using spatial and functional compatibility for recognition, IEEE TPAMI, 2009.
DOI : 10.1109/tpami.2009.83

H. Hajimirsadeghi, W. Yan, A. Vahdat, and G. Mori, Visual recognition by counting instances: A multi-instance cardinality potential kernel, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298875

URL : http://arxiv.org/pdf/1502.02063

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
DOI : 10.1109/CVPR.2016.90

URL : http://arxiv.org/pdf/1512.03385

G. Hinton and R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science, vol.313, issue.5786, pp.313504-507, 2006.
DOI : 10.1126/science.1127647

Y. Hochberg, A sharper Bonferroni procedure for multiple tests of significance, Biometrika, vol.75, issue.4, pp.800-802, 1988.
DOI : 10.1093/biomet/75.4.800

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, 1997.
DOI : 10.1016/0893-6080(88)90007-X

D. Hoiem, A. A. Efros, and M. Hebert, Putting objects in perspective, 2008.
DOI : 10.1109/cvpr.2006.232

URL : http://www.cs.cmu.edu/~dhoiem/publications/ijcv2008ObjectsInPerspective.pdf

B. Holland, An Improved Sequentially Rejective Bonferroni Test Procedure, Biometrics, vol.43, issue.2, pp.417-423, 1987.
DOI : 10.2307/2531823

URL : http://sci2s.ugr.es/keel/pdf/algorithm/articulo/1987-Holland-BIO.pdf

S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics, vol.6, issue.2, pp.65-70, 1979.

P. J. Huber, Robust Statistics, 2004.
DOI : 10.1002/0471725250

M. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, A Hierarchical Deep Temporal Model for Group Activity Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.217

URL : http://arxiv.org/pdf/1511.06040

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint, 2015.

K. Vamsi, S. N. Ithapu, V. Ravi, and . Singh, On architectural choices in deep learning: From network structure to gradient convergence and parameter estimation, 1702.

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Reading Text in the Wild with Convolutional Neural Networks, International Journal of Computer Vision, vol.20, issue.9, 2016.
DOI : 10.1109/TIP.2011.2126586

URL : http://arxiv.org/pdf/1412.1842

M. Jain, J. C. Van-gemert, and C. G. Snoek, What do 15,000 object categories tell us about classifying and localizing actions?, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298599

URL : https://pure.uva.nl/ws/files/2493740/167605_JainCVPR2015.pdf

S. Johnson and M. Everingham, Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation, Procedings of the British Machine Vision Conference 2010, 2010.
DOI : 10.5244/C.24.12

URL : http://www.bmva.org/bmvc/2010/conference/paper12/paper12.pdf

S. Khamis, I. Vlad, . Morariu, S. Larry, and . Davis, Combining Per-frame and Per-track Cues for Multi-person Action Recognition, ECCV, 2012.
DOI : 10.1007/978-3-642-33718-5_9

URL : http://www.umiacs.umd.edu/%7Esameh/khamis-eccv2012.pdf

J. Hyunwoo, B. M. Kim, . Smith, C. R. Adluru, S. C. Dyer et al., Abundant Inverse Regression Using Sufficient Reduction and Its Applications, ECCV, 2016.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, ICLR, 2014.

H. Kjellström, J. Romero, D. Martínez, and D. Kragi´ckragi´c, Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects, ECCV, 2008.
DOI : 10.1109/CVPR.2007.383299

J. Kober, J. A. Bagnell, and J. Peters, Reinforcement learning in robotics: A survey, p.131, 2013.
DOI : 10.1007/978-3-319-03194-1_2

URL : http://www.ri.cmu.edu/pub_files/2013/7/Kober_IJRR_2013.pdf

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, 2012.
DOI : 10.1162/neco.2009.10-08-881

URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

T. Lan, Y. Wang, G. Mori, N. Stephen, and . Robinovitch, Retrieving Actions in Group Contexts, Trends and Topics in Computer Vision, 2010.
DOI : 10.1007/978-3-642-35749-7_14

URL : http://www.cs.sfu.ca/%7Emori/research/papers/lan_sga10.pdf

T. Lan, Y. Wang, W. Yang, and G. Mori, Beyond actions: Discriminative models for contextual group activities, NIPS, 2010.

T. Lan, Y. Wang, W. Yang, N. Stephen, G. Robinovitch et al., Discriminative latent models for recognizing contextual group activities, IEEE TPAMI, 2012.

Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj, Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition, CVPR, 2015.

P. Stéphanelathuilì-ere, X. Mesejo, R. Alameda-pineda, and . Horaud, A comprehensive analysis of deep regression, 2018.

G. Stéphanelathuilì-ere, R. Evangelidis, and . Horaud, Recognition of group activities in videos based on single-and two-person descriptors, IEEE WACV, 2017.

R. Stéphanelathuilì-ere, P. Juge, R. M. Mesejo, R. Salinas, and . Horaud, Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation, CVPR, 2017.

R. Stéphanelathuilì-ere, P. Juge, R. Mesejo, R. Munoz-salinas, and . Horaud, Deep mixture of linear inverse regressions applied to head-pose estimation, IEEE CVPR, 2017.

B. Stéphanelathuilì-ere, P. Massé, R. Mesejo, and . Horaud, Deep reinforcement learning for audio-visual servoing in human-robot interaction, 2017.

P. Stéphanelathuilì-ere, X. Mesejo, R. Alameda-pineda, and . Horaud, Deepgum: Deep robust regression with gaussian-uniform mixtures, 2018.

Y. Lecun, L. Bottou, G. B. Orr, and K. Müller, Effiicient backprop, Neural Networks: Tricks of the Trade, pp.9-50, 1998.

K. Li, Sliced Inverse Regression for Dimension Reduction, Journal of the American Statistical Association, vol.13, issue.414, 1991.
DOI : 10.1214/aos/1176345514

URL : http://www.unc.edu/~chongz/Spring2012/SIR.pdf

R. Li, R. Chellappa, and K. Zhou, Learning multi-modal densities on discriminative temporal interaction manifold for group activity recognition, CVPR, 2009.

X. Li, L. Zhao, L. Wei, M. Yang, F. Wu et al., DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection, IEEE Transactions on Image Processing, vol.25, issue.8, 2016.
DOI : 10.1109/TIP.2016.2579306

URL : http://arxiv.org/pdf/1510.05484

X. Li, L. Girin, F. Badeig, and R. Horaud, Reverberant sound localization with a robot head based on direct-path relative transfer function, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016.
DOI : 10.1109/IROS.2016.7759437

URL : https://hal.archives-ouvertes.fr/hal-01349771

X. Li, L. Girin, R. Horaud, and S. Gannot, Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization With Spatial Sparsity Regularization, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, issue.10, 2017.
DOI : 10.1109/TASLP.2017.2740001

URL : https://hal.archives-ouvertes.fr/hal-01413417

Y. Li, J. Yang, Y. Song, L. Cao, J. Luo et al., Learning from Noisy Labels with Distillation. arXiv preprint, 2017.
DOI : 10.1109/iccv.2017.211

URL : http://arxiv.org/pdf/1703.02391

X. Liu, W. Liang, Y. Wang, S. Li, and M. Pei, 3D head pose estimation with convolutional neural network trained on synthetic images, 2016 IEEE International Conference on Image Processing (ICIP), 2016.
DOI : 10.1109/ICIP.2016.7532566

Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.124

Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang, Fashion Landmark Detection in the Wild, ECCV, 2016.
DOI : 10.5244/C.24.12

URL : http://arxiv.org/pdf/1608.03049

A. Magassouba, N. Bertin, and F. Chaumette, Aural Servo: Sensor-Based Control From Robot Audition, IEEE Transactions on Robotics, 2018.
DOI : 10.1109/TRO.2018.2805310

URL : https://hal.archives-ouvertes.fr/hal-01694366

A. Ricardo, . Maronna, R. Douglas, . Martin, J. Victor et al., Robust statistics, 2006.

M. Marszalek, I. Laptev, and C. Schmid, Actions in context, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206557

URL : https://hal.archives-ouvertes.fr/inria-00548645

B. Massé, R. Siì-eye-ba, and . Horaud, Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
DOI : 10.1109/TPAMI.2017.2782819

P. Meer, D. Mintz, A. Rosenfeld, and D. Kim, Robust regression methods for computer vision: A review, International Journal of Computer Vision, vol.53, issue.1, 1991.
DOI : 10.1002/0471725250

P. Mesejo, O. Ibánez, E. Fernández-blanco, F. Cedrón, A. Pazos et al., Artificial Neuron???Glia Networks Learning Approach Based on Cooperative Coevolution, International Journal of Neural Systems, vol.21, issue.04, pp.25-2015
DOI : 10.1142/S0129065714400061

URL : https://hal.archives-ouvertes.fr/hal-01221226

S. Miao, Z. J. Wang, and R. Liao, A CNN Regression Approach for Real-Time 2D/3D Registration, IEEE Transactions on Medical Imaging, vol.35, issue.5, 2016.
DOI : 10.1109/TMI.2016.2521800

D. Mishkin, N. Sergievskiy, and J. Matas, Systematic evaluation of convolution neural network advances on the Imagenet, Computer Vision and Image Understanding, vol.161, pp.11-19, 2017.
DOI : 10.1016/j.cviu.2017.05.007

URL : http://arxiv.org/pdf/1606.02228

N. Mitsunaga, C. Smith, T. Kanda, H. Ishiguro, and N. Hagita, Robot behavior adaptation for human-robot interaction based on policy gradient reinforcement learning, JRSJ, 2006.
DOI : 10.1109/iros.2005.1545206

URL : http://kth.diva-portal.org/smash/get/diva2:436245/FULLTEXT01

V. Mnih, K. Kavukcuoglu, D. Silver, and A. Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari With Deep Reinforcement Learning, NIPS Deep Learning Workshop, 2013.

V. Mnih, K. Kavukcuoglu, D. Silver, A. Andrei, J. Rusu et al., Human-level control through deep reinforcement learning, Nature, vol.101, issue.7540, 2015.
DOI : 10.1016/S0004-3702(98)00023-X

S. S. Mukherjee and N. M. Robertson, Deep Head Pose: Gaze-Direction Estimation in Multimodal Video, IEEE Transactions on Multimedia, vol.17, issue.11, 2015.
DOI : 10.1109/TMM.2015.2482819

URL : http://ieeexplore.ieee.org:80/stamp/stamp.jsp?tp=&arnumber=7279167

K. Murphy, A. Torralba, and W. Freeman, Using the forest to see the trees: a graphical model relating features, objects and scenes, NIPS, 2003.

P. Kevin and . Murphy, Machine learning: a probabilistic perspective, 2012.

E. Murphy-chutorian and M. Trivedi, Head Pose Estimation in Computer Vision: A Survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.31, issue.4, 2009.
DOI : 10.1109/TPAMI.2008.106

URL : http://cvrr.ucsd.edu/publications/2008/MurphyChutorian_Trivedi_PAMI08.pdf

M. Nabi, A. Del-bue, and V. Murino, Temporal Poselets for Collective Activity Detection and Recognition, 2013 IEEE International Conference on Computer Vision Workshops, 2013.
DOI : 10.1109/ICCVW.2013.71

URL : http://haci2013.umiacs.umd.edu/papers/NabiHACI2013.pdf

C. Nebauer, Evaluation of convolutional neural networks for visual recognition, IEEE Transactions on Neural Networks, vol.9, issue.4, pp.685-696, 1998.
DOI : 10.1109/72.701181

P. Nemenyi, Distribution-free multiple comparisons, 1963.

R. Neuneier and H. G. Zimmermann, How to train neural networks, Neural Networks: Tricks of the Trade, 1998.
DOI : 10.1007/978-3-642-35289-8_23

N. Neykov, P. Filzmoser, P. Dimova, and . Neytchev, Robust fitting of mixtures using the trimmed likelihood estimator. CSDA, 2007.
DOI : 10.1016/j.csda.2006.12.024

R. Nuzzo, Scientific method: Statistical errors, Nature, vol.506, issue.7487, pp.506150-152, 2014.
DOI : 10.1038/506150a

URL : http://www.nature.com:80/polopoly_fs/1.14700!/menu/main/topColumns/topLeftColumn/pdf/506150a.pdf

S. Odashima, M. Shimosaka, T. Kaneko, R. Fukui, and T. Sato, Collective Activity Localization with Contextual Spatial Pyramid, ECCV, 2012.
DOI : 10.1007/978-3-642-33885-4_25

M. Osadchy, Y. Le-cun, and M. L. Miller, Synergistic face detection and pose estimation with energy-based models. JMLR, 2007.
DOI : 10.1007/11957959_10

W. Ouyang, X. Chu, and X. Wang, Multi-source Deep Learning for Human Pose Estimation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.2329-2336, 2014.
DOI : 10.1109/CVPR.2014.299

E. Perthame, F. Forbes, and A. Deleforge, Inverse regression approach to robust non-linear high-to-low dimensional mapping
DOI : 10.1016/j.jmva.2017.09.009

URL : https://hal.archives-ouvertes.fr/hal-01347455

L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele, Articulated people detection and pose estimation: Reshaping the future, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.3178-3185, 2012.
DOI : 10.1109/CVPR.2012.6248052

URL : http://www.informatik.uni-marburg.de/~thormae/paper/CVPR12.pdf

R. Poppe, A survey on vision-based human action recognition, Image and Vision Computing, vol.28, issue.6, 2010.
DOI : 10.1016/j.imavis.2009.11.014

A. H. Qureshi, Y. Nakamura, Y. Yoshikawa, and H. Ishiguro, Robot gains social intelligence through multimodal deep reinforcement learning, 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), 2016.
DOI : 10.1109/HUMANOIDS.2016.7803357

URL : http://arxiv.org/pdf/1702.07492

Y. Ahmed-hussain-qureshi, Y. Nakamura, H. Yoshikawa, and . Ishiguro, Show, attend and interact: Perceivable human-robot social interaction through neural attention Q-network, IEEE ICRA, 2017.

A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie, Objects in Context, 2007 IEEE 11th International Conference on Computer Vision, 2007.
DOI : 10.1109/ICCV.2007.4408986

D. Ramanan, Learning to parse images of articulated bodies, NIPS, 2007.

R. Ranjan, V. M. Patel, and R. Chellappa, HyperFace: A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
DOI : 10.1109/TPAMI.2017.2781233

URL : http://arxiv.org/pdf/1603.01249

K. Shaoqing-ren, R. B. He, J. Girshick, and . Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE TPAMI, vol.39, pp.1137-1149, 2015.

G. Riegler, D. Ferstl, M. Ruther, and H. Bischof, Hough Networks for Head Pose Estimation and Facial Feature Localization, Proceedings of the British Machine Vision Conference 2014, 2014.
DOI : 10.5244/C.28.66

URL : http://www.bmva.org/bmvc/2014/files/abstract039.pdf

D. Rom, A sequentially rejective test procedure based on a modified Bonferroni inequality, Biometrika, vol.77, issue.3, pp.663-665, 1990.
DOI : 10.1093/biomet/77.3.663

M. Rothbucher, C. Denk, and K. Diepold, Robotic gaze control using reinforcement learning, 2012 IEEE International Workshop on Haptic Audio Visual Environments and Games (HAVE 2012) Proceedings, 2012.
DOI : 10.1109/HAVE.2012.6374444

R. Rothe, R. Timofte, and L. Van-gool, Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks, International Journal of Computer Vision, vol.30, issue.6, p.2016
DOI : 10.1109/ICCVW.2015.43

J. Peter, . Rousseeuw, M. Annick, and . Leroy, Robust regression and outlier detection, 2005.

E. David, . Rumelhart, E. Geoffrey, R. Hinton, J. Russakovsky et al., Learning internal representations by error propagation, 1985.

S. Michael, J. K. Ryoo, and . Aggarwal, Recognition of composite human activities through context-free grammar based representation, CVPR, 2006.

S. Michael, J. K. Ryoo, and . Aggarwal, Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities, ICCV, 2009.

M. Ryoo and J. Aggarwal, Stochastic representation and recognition of high-level group activities. IJCV, 2011.
DOI : 10.1007/s11263-010-0355-5

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image Classification with the Fisher Vector: Theory and Practice, International Journal of Computer Vision, vol.73, issue.2, 2013.
DOI : 10.1007/s11263-006-9794-4

C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2004.
DOI : 10.1109/ICPR.2004.1334462

URL : http://www.nada.kth.se/%7Ecaputo/publik/icpr04actions.pdf

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus et al., Overfeat: Integrated recognition, localization and detection using convolutional networks, ICLR, 2014.

A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe, Deformable gans for pose-based human image generation, IEEE CVPR, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01761539

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint, 2014.

N. Leslie, N. Smith, and . Topin, Deep Convolutional Neural Network Design Patterns. CoRR, abs, 1611.

J. Alex, B. Smola, and . Schölkopf, A tutorial on support vector regression, Stat Comput, 2004.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, JMLR, vol.15, pp.1929-1958, 2014.

E. Stepanov, S. Lathuiliere, A. Shammur-absar-chowdhury, R. Ghosh, N. Vieriu et al., Depression severity estimation from multiple modalities, 2018.

A. Jonathan, D. Sterne, G. D. Cox, and . Smith, Sifting the evidence?what's wrong with significance tests?Another comment on the role of statistical methods, BMJ, issue.7280, pp.322226-231, 2001.

V. Charles and . Stewart, Robust parameter estimation in computer vision, SIAM Review, 1999.

L. Sun, H. Ai, and S. Lao, Activity Group Localization by Modeling the Relations among Participants, ECCV, 2014.
DOI : 10.1007/978-3-319-10590-1_48

URL : http://media.cs.tsinghua.edu.cn/%7Eimagevision/papers/eccv14-sunlei-86890741.pdf

Y. Sun, X. Wang, and X. Tang, Deep Convolutional Network Cascade for Facial Point Detection, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.446

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization and momentum in deep learning, ICML, pp.1139-1147, 2013.

S. Richard, A. G. Sutton, and . Barto, Introduction to Reinforcement Learning, 1998.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298594

URL : http://arxiv.org/pdf/1409.4842

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the Inception Architecture for Computer Vision, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2818-2826, 2016.
DOI : 10.1109/CVPR.2016.308

URL : http://arxiv.org/pdf/1512.00567

N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall et al., Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning?, IEEE Transactions on Medical Imaging, vol.35, issue.5, pp.1299-1312, 2016.
DOI : 10.1109/TMI.2016.2535302

URL : http://arxiv.org/pdf/1706.00712

L. Andrea, G. Thomaz, C. Hoffman, and . Breazeal, Reinforcement learning with human teachers: Understanding how people want to teach robots, IEEE ROMAN, 2006.

T. Tieleman and G. Hinton, Rrmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning, pp.26-31, 2012.

A. Toshev and C. Szegedy, DeepPose: Human Pose Estimation via Deep Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.214

URL : http://arxiv.org/pdf/1312.4659

P. Turaga, R. Chellappa, S. Venkatramana, O. Subrahmanian, and . Udrea, Machine Recognition of Human Activities: A Survey, IEEE Transactions on Circuits and Systems for Video Technology, vol.18, issue.11, 2008.
DOI : 10.1109/TCSVT.2008.2005594

URL : http://www.cfar.umd.edu/%7Erama/Publications/Turaga_CSVT_2008.pdf

M. Vázquez, A. Steinfeld, and S. E. Hudson, Maintaining awareness of the focus of attention of a conversation: A robot-centric reinforcement learning approach, 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), p.137, 2016.
DOI : 10.1109/ROMAN.2016.7745088

B. Vidgen and T. Yasseri, P-Values: Misunderstood and Misused, Frontiers in Physics, vol.13, issue.6, 2016.
DOI : 10.1038/nature.2014.15787

URL : http://journal.frontiersin.org/article/10.3389/fphy.2016.00006/pdf

B. Wang, W. Liang, Y. Wang, and Y. Liang, Head Pose Estimation with Combined 2D SIFT and 3D HOG Features, 2013 Seventh International Conference on Image and Graphics, 2013.
DOI : 10.1109/ICIG.2013.133

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.441

URL : https://hal.archives-ouvertes.fr/hal-00873267

F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bulletin, pp.80-83, 1945.
DOI : 10.2307/3001968

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, 1992.
DOI : 10.1007/978-1-4615-3618-5_2

URL : http://www.cs.ualberta.ca/~sutton/williams-92.pdf

T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, Learning from massive noisy labeled data for image classification, CVPR, 2015.

X. Xiong and F. De-la-torre, Supervised Descent Method and Its Applications to Face Alignment, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.75

URL : http://www.ri.cmu.edu/pub_files/2013/5/main.pdf

S. Yan, H. Wang, X. Tang, and T. S. Huang, Learning Auto-Structured Regressor from Uncertain Nonnegative Labels, 2007 IEEE 11th International Conference on Computer Vision, pp.1-8, 2007.
DOI : 10.1109/ICCV.2007.4409050

URL : http://www.lv-nus.org/papers/2007/2007_c_2.pdf

H. Yang, W. Mou, and Y. Zhang, Ioannis Patras, Hatice Gunes, and Peter Robinson. Face alignment assisted by head pose estimation, BMVC, 2015.

Y. Yang and D. Ramanan, Articulated Human Detection with Flexible Mixtures of Parts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.12, pp.2878-2890, 2013.
DOI : 10.1109/TPAMI.2012.261

URL : http://www.ics.uci.edu/~dramanan/papers/pose_pami.pdf

B. Yao and L. Fei-fei, Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses, IEEE TPAMI, 2012.

X. Yao, Evolving artificial neural networks, Proceedings of the IEEE, vol.87, issue.9, pp.1423-1447, 1999.

J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, How Transferable Are Features in, Deep Neural Networks? In NIPS, pp.3320-3328, 2014.

X. Stella, J. Yu, and . Shi, Multiclass spectral clustering, ICCV, 2003.

S. Yun, SUMMARY, Robotica, vol.5, issue.11, pp.2122-2138, 2017.
DOI : 10.1016/j.patrec.2010.09.011

D. Matthew and . Zeiler, Adadelta: an adaptive learning rate method. arXiv preprint, 2012.

X. Zhen, Z. Wang, A. Islam, M. Bhaduri, I. Chan et al., Multi-scale deep networks and regression forests for direct bi-ventricular volume estimation, Medical Image Analysis, vol.30, 2016.
DOI : 10.1016/j.media.2015.07.003

X. Zhu and D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, CVPR, pp.2879-2886, 2012.