?. Do,

, In the case of VGG-16, the 2D feature maps of CB 5 are converted via a flattening layer F l that does not reduce the dimension. Alternatively, we could replace F l by global (max or average) pooling, denoted by GM P and GAP respectively. ResNet-50 already uses GAP and hence we can only compare to GM P. The models are denoted GAP, GMP, vol.2, p.16

?. ,

, show results obtained with various fine-tuning depth values, as described in section 6.5, both for VGG-16 and for ResNet-50. In the case of VGG-16, we 96 CHAPTER 6. A COMPREHENSIVE ANALYSIS OF DEEP REGRESSION Table 6.6: Impact of the regressed layer (RL) when using VGG-16. 47

?. ,

, Table 6.10: Impact of the data pre-processing on VGG-16 and ResNet-50

, Data Set & VGG-16

. Biwi, The baseline for the Biwi data set is inspired from

A. Manuscript,

G. Stéphanelathuilì-ere, R. Evangelidis, and . Horaud, Recognition of group activities in videos based on single-and two-person descriptors, IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.

R. Stéphanelathuilì-ere, P. Juge, R. Mesejo, R. Munoz-salinas, and . Horaud, Deep mixture of linear inverse regressions applied to head-pose estimation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

B. Stéphanelathuilì-ere, P. Massé, R. Mesejo, and . Horaud, Deep Reinforcement Learning for Audio-Visual Servoing in Human-Robot Interaction, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018.

B. Stéphanelathuilì-ere, P. Massé, R. Mesejo, and . Horaud, Neural Network-based Reinforcement Learning for Audio-Visual Gaze Control in HumanRobot Interaction, Pattern Recognition letters, 2018.

P. Stéphanelathuilì-ere, X. Mesejo, R. Alameda-pineda, and . Horaud, DeepGUM: Deep Robust Regression with Gaussian-Uniform Mixtures, IEEE European Conference of Computer Vision (ECCV), 2018.

P. Stéphanelathuilì-ere, X. Mesejo, R. Alameda-pineda, and . Horaud, A Comprehensive Analysis of Deep Regression

A. Siarohin, E. Sangineto, N. Stéphanelathuilì-ere, and . Sebe, Deformable gans for pose-based human image generation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01761539

E. Stepanov, . Stéphanelathuilì-ere, A. Shammur, A. Chowdhury, R. Ghosh et al., Depression severity estimation from multiple modalities, IEEE International Conference on E-health Networking

J. Byungtae-ahn, I. S. Park, and . Kweon, Real-time head orientation from a monocular camera using deep neural network, ACCV, 2014.

R. Mohamed, S. Amer, and . Todorovic, A chains model for localizing participants of group activities in videos, ICCV, 2011.

M. R. Amer, P. Lei, and S. Todorovic, HIRF: Hierarchical random field for collective activity recognition in videos, ECCV, 2014.
DOI : 10.1007/978-3-319-10599-4_37
URL : http://web.engr.oregonstate.edu/~sinisa/research/publications/eccv14_HiRF.pdf

M. Andriluka, S. Roth, and B. Schiele, Pictorial structures revisited: People detection and articulated pose estimation, CVPR, pp.1014-1021, 2009.
DOI : 10.1109/cvprw.2009.5206754
URL : http://www.gris.informatik.tu-darmstadt.de/~sroth/pubs/cvpr09andriluka.pdf

F. Badeig, Q. Pelorson, S. Arias, V. Drouard, I. Gebru et al., Georgios Evangelidis, and Radu Horaud. A distributed architecture for interacting with nao, ACM ICMI, 2015.

Y. Ban, X. Alameda-pineda, F. Badeig, S. Ba, and R. Horaud, Tracking a varying number of people with a visually-controlled robotic head, IEEE/RSJ IROS, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01542987

D. Jeffrey, A. E. Banfield, and . Raftery, Model-based gaussian and non-gaussian clustering, Biometrics, 1993.

A. J. Bekker and J. Goldberger, Training deep neural-networks based on unreliable labels, ICASSP, 2016.
DOI : 10.1109/icassp.2016.7472164

V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, Robust optimization for deep regression, ICCV, 2015.

G. Beliakov, A. V. Kelarev, and J. Yearwood, Robust artificial neural networks and outlier detection, 2011.

Y. Bengio, Practical recommendations for gradient-based training of deep architectures, Neural networks: Tricks of the trade, pp.437-478, 2012.
DOI : 10.1007/978-3-642-35289-8_26
URL : http://arxiv.org/pdf/1206.5533.pdf

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw, 1994.
DOI : 10.1109/72.279181
URL : http://www.research.microsoft.com/~patrice/PDF/long_term.pdf

M. Bennewitz, F. Faber, D. Joho, M. Schreiber, and S. Behnke, Towards a humanoid museum guide robot that interacts with multiple persons, IEEE-RAS, pp.418-423, 2005.
DOI : 10.1109/ichr.2005.1573603
URL : http://www.informatik.uni-freiburg.de/~maren/papers/bennewitz_humanoids05.pdf

J. Michael, A. Black, and . Rangarajan, On the unification of line processes, outlier rejection, and robust statistics with applications in early vision, IJCV, 1996.

A. Bulat and G. Tzimiropoulos, How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks), ICCV, 2017.

P. Xavier-p-burgos-artizzu, P. Perona, and . Dollár, Robust face landmark estimation under occlusion, ICCV, pp.1513-1520, 2013.

G. Bustamante, P. Danés, T. Forgue, and A. Podlubne, Towards information-based feedback control for binaural active localization, IEEE ICASSP, 2016.
DOI : 10.1109/icassp.2016.7472894

Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, IEEE CVPR, 2017.
DOI : 10.1109/cvpr.2017.143
URL : http://arxiv.org/pdf/1611.08050

V. Chandrasekhar, J. Lin, O. Morère, H. Goh, and A. Veillard, A practical guide to CNNs and Fisher Vectors for image instance retrieval, Signal Processing, vol.128, pp.426-439, 2016.
DOI : 10.1016/j.sigpro.2016.05.021
URL : http://arxiv.org/pdf/1508.02496

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, 2002.
DOI : 10.1613/jair.953
URL : https://jair.org/index.php/jair/article/download/10302/24590

C. Bor-chun-chen, W. H. Chen, and . Hsu, Cross-age reference coding for age-invariant face recognition and retrieval, ECCV, 2014.

W. Choi and S. Savarese, A unified framework for multi-target tracking and collective activity recognition, ECCV, 2012.
DOI : 10.1007/978-3-642-33765-9_16
URL : http://www.eecs.umich.edu/vision/papers/choi_eccv_12.pdf

W. Choi, Y. Chao, C. Pantofaru, and S. Savarese, Discovering groups of people in images, ECCV, 2014.
DOI : 10.1007/978-3-319-10593-2_28
URL : http://cvgl.stanford.edu/projects/groupdiscovery/eccv2014choi.pdf

W. Choi and S. Savarese, Understanding collective activities of people from videos, IEEE TPAMI, 2013.
DOI : 10.1109/tpami.2013.220

W. Choi, K. Shahid, and S. Savarese, What are they doing?: Collective activity classification using spatio-temporal relationship among people, ICCV Workshops, 2009.

F. Chollet,

W. Conover, Practical Nonparametric Statistics. Kirjastus: John Wiley and Sons (WIE), 1998.

P. Coretto and C. Hennig, Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering, JASA, 2016.

A. Cretual and F. Chaumette, Application of motion-based visual servoing to target tracking. IJRR, 2001.

F. Cruz, G. I. Parisi, J. Twiefel, and S. Wermter, Multimodal integration of dynamic audiovisual patterns for an interactive reinforcement learning scenario, IEEE/RSJ IROS, 2016.

F. Cupillard, F. Brémond, and M. Thonnat, Group behavior recognition with multiple cameras, WACV, 2002.

N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, CVPR, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00548512

M. Dantone, J. Gall, G. Fanelli, and L. Van-gool, Real-time facial feature detection using conditional regression forests, CVPR, pp.2578-2585, 2012.

A. Deleforge, R. Horaud, Y. Y. Schechner, and L. Girin, Co-localization of audio sources in images using binaural features and locally-linear regression, IEEE TASLP, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01112834

A. Deleforge and F. Forbes, Hyper-Spectral Image Analysis with Partially-Latent Regression and Spatial Markov Dependencies, IEEE STSP, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01136465

A. Deleforge, F. Forbes, and R. Horaud, High-Dimensional Regression with Gaussian Mixtures and Partially-Latent Response Variables, Stat Comput, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01107604

M. Demirkus, D. Precup, J. J. Clark, and T. Arbel, Hierarchical temporal graphical model for head pose estimation and subsequent attribute classification in real-world videos, 2015.

Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan et al., Deep structured models for group activity recognition, BMVC, 2015.

J. Derrac, S. Garca, D. Molina, and F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm and Evolutionary Computation, vol.1, issue.1, pp.3-18, 2011.

V. Drouard, S. Ba, G. Evangelidis, A. Deleforge, and R. Horaud, Head pose estimation via probabilistic high-dimensional regression, ICIP, 2015.
DOI : 10.1109/icip.2015.7351683
URL : https://hal.archives-ouvertes.fr/hal-01163663

V. Drouard, R. Horaud, and A. Deleforge, Siì eye Ba, and Georgios Evangelidis. Robust head-pose estimation based on partially-latent mixture of linear regressions, IEEE TIP, 2016.

V. Drouard, R. Horaud, and A. Deleforge, Siì eye Ba, and Georgios Evangelidis. Robust head-pose estimation based on partially-latent mixture of linear regressions, IEEE TIP, vol.26, issue.3, pp.1428-1440, 2017.

V. Drouard, R. Horaud, and A. Deleforge, Robust head-pose estimation based on partially-latent mixture of linear regressions, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01413406

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, JMLR, vol.12, issue.7, pp.2121-2159, 2011.

O. J. Dunn, Multiple comparisons among means, Journal of the American Statistical Association, vol.56, pp.52-64, 1961.
DOI : 10.2307/2282330

Y. Dumitru-erhan, A. Bengio, P. Courville, P. Manzagol, S. Vincent et al., Why Does Unsupervised Pre-training Help Deep Learning?, vol.11, pp.625-660, 2010.

G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. Gool, Random Forests for Real Time 3D Face Analysis. IJCV, 2013.

G. Fanelli, J. Gall, and L. Van-gool, Real time head pose estimation with random regression forests, CVPR, pp.617-624, 2011.

R. A. Fisher, Statistical methods for research workers, 1925.

F. Forbes and D. Wraith, A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering, Statistics and Computing, 2014.

A. Galimzianova, F. Pernus, B. Likar, and Z. Spiclin, Robust estimation of unbalanced mixture models on samples with outliers, 2015.

J. Gan, L. Li, Y. Zhai, and Y. Liu, Deep self-taught learning for facial beauty prediction, Neurocomputing, 2014.

C. Gaskett, L. Fletcher, and A. Zelinsky, Reinforcement learning for visual servoing of a mobile robot, Australian Conference on Robotics and Automation, 2000.

I. Gebru, S. Ba, X. Li, and R. Horaud, Audio-visual speaker diarization based on spatiotemporal bayesian fusion, IEEE TPAMI, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01413403

A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis. Chapman & Hall/CRC Texts in Statistical Science, 2003.

A. Ghadirzadeh, J. Bütepage, A. Maki, D. Kragic, and M. Björkman, A sensorimotor reinforcement learning framework for physical HumanRobot Interaction, IEEE/RSJ IROS, 2016.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR, 2014.

I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, 2016.

I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. Book in preparation for, 2016.

K. Greff, R. K. Srivastava, J. Koutník, R. Bas, J. Steunebrink et al., LSTM: A search space odyssey, IEEE TNNLS, vol.28, issue.10, pp.2222-2232, 2017.

G. Guo, Y. Fu, C. R. Dyer, and T. Huang, Image-based human age estimation by manifold learning and locally adjusted robust regression, IEEE TIP, vol.17, issue.7, pp.1178-1188, 2008.

A. Gupta, A. Kembhavi, and L. Davis, Observing humanobject interactions: Using spatial and functional compatibility for recognition, IEEE TPAMI, 2009.

H. Hajimirsadeghi, W. Yan, A. Vahdat, and G. Mori, Visual recognition by counting instances: A multi-instance cardinality potential kernel, CVPR, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, pp.770-778, 2016.

G. Hinton and R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, vol.313, issue.5786, pp.504-507, 2006.

Y. Hochberg, A Sharper Bonferroni Procedure for Multiple Tests of Significance, Biometrika, vol.75, issue.4, pp.800-802, 1988.

S. Hochreiter and J. Schmidhuber, Long short-term memory. Neural Computation, 1997.

D. Hoiem, A. A. Efros, and M. Hebert, Putting objects in perspective. IJCV, 2008.

B. S. Holland, An improved sequentially rejective Bonferroni test procedure, Biometrics, vol.43, pp.417-423, 1987.

S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics, vol.6, issue.2, pp.65-70, 1979.

G. Hommel, A stagewise rejective multiple test procedure based on a modified Bonferroni test, Biometrika, vol.75, issue.2, pp.383-386, 1988.

P. J. Huber, Robust Statistics, 2004.

M. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, A hierarchical deep temporal model for group activity recognition, CVPR, 2016.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.

K. Vamsi, . Ithapu, N. Sathya, V. Ravi, and . Singh, On architectural choices in deep learning: From network structure to gradient convergence and parameter estimation, 2017.

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Reading text in the wild with convolutional neural networks. IJCV, 2016.

M. Jain, J. C. Van-gemert, and C. G. Snoek, What do 15,000 object categories tell us about classifying and localizing actions? In CVPR, 2015.

S. Johnson and M. Everingham, Clustered pose and nonlinear appearance models for human pose estimation, BMVC, 2010.

S. Khamis, L. Vlad-i-morariu, and . Davis, Combining per-frame and per-track cues for multi-person action recognition, ECCV, 2012.

J. Hyunwoo, B. M. Kim, N. Smith, C. R. Adluru, S. C. Dyer et al., Abundant Inverse Regression Using Sufficient Reduction and Its Applications, ECCV, 2016.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, ICLR, 2014.

H. Kjellström, J. Romero, D. Martínez, and D. Kragi´ckragi´c, Simultaneous visual recognition of manipulation actions and manipulated objects, ECCV, 2008.

J. Kober, J. A. Bagnell, and J. Peters, Reinforcement learning in robotics: A survey. IJRR, 2013.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012.

T. Lan, Y. Wang, G. Mori, and S. Robinovitch, Retrieving actions in group contexts, Trends and Topics in Computer Vision, 2010.

T. Lan, Y. Wang, W. Yang, and G. Mori, Beyond actions: Discriminative models for contextual group activities, NIPS, 2010.

T. Lan, Y. Wang, W. Yang, G. Stephen-n-robinovitch, and . Mori, Discriminative latent models for recognizing contextual group activities, IEEE TPAMI, 2012.

Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj, Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition, CVPR, 2015.

,. Stéphanelathuilì-ere, P. Mesejo, X. Alameda-pineda, and R. Horaud, A comprehensive analysis of deep regression, 2018.

G. Stéphanelathuilì-ere, R. Evangelidis, and . Horaud, Recognition of group activities in videos based on single-and two-person descriptors, IEEE WACV, 2017.

R. Stéphanelathuilì-ere, P. Juge, R. M. Mesejo, R. Salinas, and . Horaud, Deep Mixture of Linear Inverse Regressions Applied to Head-Pose Estimation, CVPR, 2017.

R. Stéphanelathuilì-ere, P. Juge, R. Mesejo, R. Munoz-salinas, and . Horaud, Deep mixture of linear inverse regressions applied to head-pose estimation, IEEE CVPR, 2017.

B. Stéphanelathuilì-ere, P. Massé, R. Mesejo, and . Horaud, Deep reinforcement learning for audio-visual servoing in human-robot interaction, 2017.

B. Stéphanelathuilì-ere, P. Massé, R. Mesejo, and . Horaud, Neural network-based reinforcement learning for audio-visual gaze control in humanrobot interaction. Pattern recognition letters, 2017.

P. Stéphanelathuilì-ere, X. Mesejo, R. Alameda-pineda, and . Horaud, Deepgum: Deep robust regression with gaussian-uniform mixtures, IEEE ECCV, 2018.

Y. Lecun, L. Bottou, G. B. Orr, and K. Müller, Neural Networks: Tricks of the Trade, pp.9-50, 1998.

K. Li, Sliced inverse regression for dimension reduction, J Am Stat Assoc, 1991.

R. Li, R. Chellappa, and . Zhou, Learning multi-modal densities on discriminative temporal interaction manifold for group activity recognition, CVPR, 2009.

X. Li, L. Zhao, L. Wei, M. Yang, F. Wu et al., Deepsaliency: Multi-task deep neural network model for salient object detection, IEEE TIP, 2016.

X. Li, L. Girin, F. Badeig, and R. Horaud, Reverberant sound localization with a robot head based on direct-path relative transfer function, IEEE/RSJ IROS, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01349771

X. Li, L. Girin, R. Horaud, and S. Gannot, Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regularization, IEEE/ACM TASLP, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01413417

Y. Li, J. Yang, Y. Song, L. Cao, J. Luo et al., Learning from Noisy Labels with Distillation, 2017.

X. Liu, W. Liang, Y. Wang, S. Li, and M. Pei, 3D head pose estimation with convolutional neural network trained on synthetic images, ICIP, 2016.

Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, Deepfashion: Powering robust clothes recognition and retrieval with rich annotations, CVPR, 2016.

Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang, Fashion Landmark Detection in the Wild, ECCV, 2016.

A. Magassouba, N. Bertin, and F. Chaumette, Aural servo: sensorbased control from robot audition, IEEE TRO, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01694366

A. Ricardo, . Maronna, . Douglas-r-martin, . Victor, and . Yohai, Robust statistics, 2006.

M. Marszalek, I. Laptev, and C. Schmid, Actions in context, CVPR, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00548645

B. Massé, R. Siì-eye-ba, and . Horaud, Tracking gaze and visual focus of attention of people involved in social interaction, IEEE TPAMI, 2017.

P. Meer, D. Mintz, A. Rosenfeld, and D. Kim, Robust regression methods for computer vision: A review. IJCV, 1991.

P. Mesejo, O. Ibánez, E. Fernández-blanco, and F. Cedrón, Alejandro Pazos, and Ana B Porto-Pazos. Artificial neuron-glia networks learning approach based on cooperative coevolution, International journal of neural systems, vol.25, issue.04, 2015.

Z. J. Shun-miao, R. Wang, and . Liao, A CNN Regression Approach for RealTime 2D/3D Registration, IEEE Trans. Med. Imag, 2016.

D. Mishkin, N. Sergievskiy, and J. Matas, Systematic evaluation of convolution neural network advances on the imagenet, vol.161, pp.11-19, 2017.

N. Mitsunaga, C. Smith, T. Kanda, H. Ishiguro, and N. Hagita, Robot behavior adaptation for human-robot interaction based on policy gradient reinforcement learning, JRSJ, 2006.

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou et al., Playing Atari With Deep Reinforcement Learning, NIPS Deep Learning Workshop, 2013.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness et al., Human-level control through deep reinforcement learning, Nature, 2015.

S. S. Mukherjee and N. M. Robertson, Deep Head Pose: Gaze-Direction Estimation in Multimodal Video, IEEE MM, 2015.
DOI : 10.1109/tmm.2015.2482819
URL : https://doi.org/10.1109/tmm.2015.2482819

K. Murphy, A. Torralba, and W. Freeman, Using the forest to see the trees: a graphical model relating features, objects and scenes, NIPS, 2003.

P. Kevin and . Murphy, Machine learning: a probabilistic perspective, 2012.

E. Murphy-chutorian and M. Trivedi, Head pose estimation in computer vision: A survey, IEEE Trans. Pattern Anal. Mach. Intell, 2009.
DOI : 10.1109/tpami.2008.106
URL : http://cvrr.ucsd.edu/publications/2008/MurphyChutorian_Trivedi_PAMI08.pdf

M. Nabi, A. Del-bue, and V. Murino, Temporal poselets for collective activity detection and recognition, ICCVW Workshops, 2013.
DOI : 10.1109/iccvw.2013.71
URL : http://haci2013.umiacs.umd.edu/papers/NabiHACI2013.pdf

C. Nebauer, Evaluation of convolutional neural networks for visual recognition, IEEE TNN, vol.9, issue.4, pp.685-696, 1998.
DOI : 10.1109/72.701181

P. Nemenyi, Distribution-free multiple comparisons, 1963.

R. Neuneier and H. G. Zimmermann, How to train neural networks, Neural Networks: Tricks of the Trade, 1998.
DOI : 10.1007/3-540-49430-8_18

N. Neykov, P. Filzmoser, P. Dimova, and . Neytchev, Robust fitting of mixtures using the trimmed likelihood estimator. CSDA, 2007.

R. Nuzzo, Scientific method: Statistical errors, Nature, vol.506, issue.7487, pp.150-152, 2014.
DOI : 10.1038/506150a
URL : http://www.nature.com:80/polopoly_fs/1.14700!/menu/main/topColumns/topLeftColumn/pdf/506150a.pdf

S. Odashima, M. Shimosaka, T. Kaneko, R. Fukui, and T. Sato, Collective activity localization with contextual spatial pyramid, ECCV, 2012.
DOI : 10.1007/978-3-642-33885-4_25

M. Osadchy, Y. Le-cun, and M. L. Miller, Synergistic face detection and pose estimation with energy-based models, 2007.
DOI : 10.1007/11957959_10

W. Ouyang, X. Chu, and X. Wang, Multi-source deep learning for human pose estimation, CVPR, pp.2329-2336, 2014.
DOI : 10.1109/cvpr.2014.299

E. Perthame, F. Forbes, and A. Deleforge, Inverse regression approach to robust non-linear high-to-low dimensional mapping, INRIA, 2016.
DOI : 10.1016/j.jmva.2017.09.009
URL : https://hal.archives-ouvertes.fr/hal-01347455

L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele, Articulated people detection and pose estimation: Reshaping the future, CVPR, pp.3178-3185, 2012.
DOI : 10.1109/cvpr.2012.6248052
URL : http://www.informatik.uni-marburg.de/~thormae/paper/CVPR12.pdf

R. Poppe, A survey on vision-based human action recognition. IVC, 2010.
DOI : 10.1016/j.imavis.2009.11.014

A. H. Qureshi, Y. Nakamura, Y. Yoshikawa, and H. Ishiguro, Robot gains social intelligence through multimodal deep reinforcement learning, IEEE Humanoids, 2016.
DOI : 10.1109/humanoids.2016.7803357
URL : http://arxiv.org/pdf/1702.07492

Y. Ahmed-hussain-qureshi, Y. Nakamura, H. Yoshikawa, and . Ishiguro, Show, attend and interact: Perceivable human-robot social interaction through neural attention Q-network, IEEE ICRA, 2017.

A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie, Objects in context, ICCV, 2007.

D. Ramanan, Learning to parse images of articulated bodies, NIPS, 2007.

R. Ranjan, M. Vishal, R. Patel, and . Chellappa, Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition, 2016.

K. Shaoqing-ren, R. B. He, J. Girshick, and . Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE TPAMI, vol.39, pp.1137-1149, 2015.

G. Riegler, D. Ferstl, M. Ruther, and H. Bischof, Hough Networks for Head Pose Estimation and Facial Feature Localization, BMVC, 2014.

G. Rogez, P. Weinzaepfel, and C. Schmid,

, Localization-classification-regression for human pose, CVPR, 2017.

D. Rom, A sequentially rejective test procedure based on a modified Bonferroni inequality, Biometrika, vol.77, pp.663-665, 1990.

M. Rothbucher, C. Denk, and K. Diepold, Robotic gaze control using reinforcement learning, IEEE HAVE, 2012.

R. Rasmus-rothe, L. Timofte, and . Van-gool, Deep expectation of real and apparent age from a single image without facial landmarks. IJCV, 2016.

J. Peter, A. Rousseeuw, and . Leroy, Robust regression and outlier detection, 2005.

G. E. David-e-rumelhart, R. Hinton, and . Williams, Learning internal representations by error propagation, 1985.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., , 2015.

S. Michael, J. K. Ryoo, and . Aggarwal, Recognition of composite human activities through context-free grammar based representation, CVPR, 2006.

S. Michael, J. K. Ryoo, and . Aggarwal, Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities, ICCV, 2009.

M. S. Ryoo, Stochastic representation and recognition of high-level group activities. IJCV, 2011.

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image classification with the fisher vector: Theory and practice, 2013.

S. Saxena and J. Verbeek, Convolutional neural fabrics. In NIPS, pp.4053-4061, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01359150

C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, ICPR, 2004.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus et al., Overfeat: Integrated recognition, localization and detection using convolutional networks, ICLR, 2014.

A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe, Deformable gans for pose-based human image generation, IEEE CVPR, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01761539

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

N. Leslie, N. Smith, and . Topin, Deep Convolutional Neural Network Design Patterns, 2016.

J. Alex, B. Smola, and . Schölkopf, A tutorial on support vector regression, Stat Comput, 2004.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, JMLR, vol.15, pp.1929-1958, 2014.

E. Stepanov, S. Lathuiliere, A. Shammur, A. Chowdhury, R. Ghosh et al., Depression severity estimation from multiple modalities, 2018.

J. Sterne, D. Cox, and G. Smith, Sifting the evidence-what's wrong with significance tests?Another comment on the role of statistical methods, BMJ, vol.322, issue.7280, pp.226-231, 2001.

V. Charles and . Stewart, Robust parameter estimation in computer vision, SIAM Review, 1999.

L. Sun, A. Haizhou, and S. Lao, Activity group localization by modeling the relations among participants, ECCV, 2014.
DOI : 10.1007/978-3-319-10590-1_48
URL : http://media.cs.tsinghua.edu.cn/%7Eimagevision/papers/eccv14-sunlei-86890741.pdf

Y. Sun, X. Wang, and X. Tang, Deep convolutional network cascade for facial point detection, CVPR, 2013.
DOI : 10.1109/cvpr.2013.446

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization and momentum in deep learning, ICML, pp.1139-1147, 2013.

R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1998.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, CVPR, 2015.
DOI : 10.1109/cvpr.2015.7298594
URL : http://arxiv.org/pdf/1409.4842

C. Szegedy, V. Vanhoucke, and S. Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision, CVPR, pp.2818-2826, 2016.

N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall et al., Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning?, IEEE TMI, vol.35, issue.5, pp.1299-1312, 2016.
DOI : 10.1109/tmi.2016.2535302
URL : http://arxiv.org/pdf/1706.00712

G. Andrea-l-thomaz, C. Hoffman, and . Breazeal, Reinforcement learning with human teachers: Understanding how people want to teach robots, IEEE ROMAN, 2006.

T. Tieleman and G. Hinton, Rrmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, vol.4, pp.26-31, 2012.

A. Toshev and C. Szegedy, DeepPose: Human Pose Estimation via Deep Neural Networks, CVPR, 2014.
DOI : 10.1109/cvpr.2014.214
URL : http://arxiv.org/pdf/1312.4659

P. Turaga, R. Chellappa, S. Venkatramana, O. Subrahmanian, and . Udrea, Machine recognition of human activities: A survey, IEEE TCSVT, p.139, 2008.
DOI : 10.1109/tcsvt.2008.2005594
URL : http://www.cfar.umd.edu/%7Erama/Publications/Turaga_CSVT_2008.pdf

M. Vázquez, A. Steinfeld, and S. E. Hudson, Maintaining awareness of the focus of attention of a conversation: A robot-centric reinforcement learning approach, 2016.

B. Vidgen and T. Yasseri, P-values: Misunderstood and misused, Frontiers in Physics, vol.4, issue.6, 2016.
DOI : 10.3389/fphy.2016.00006
URL : https://www.frontiersin.org/articles/10.3389/fphy.2016.00006/pdf

B. Wang, W. Liang, Y. Wang, and Y. Liang, Head pose estimation with combined 2D SIFT and 3D HOG features, ICIG, 2013.

H. Wang and C. Schmid, Action recognition with improved trajectories, ICCV, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00873267

J. C. Christopher, P. Watkins, and . Dayan, Machine Learning, 1992.

F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bulletin, pp.80-83, 1945.

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, 1992.

T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, Learning from massive noisy labeled data for image classification, CVPR, 2015.

L. Xie, A. L. Yuille, and C. Genetic, CVPR, 2017.

X. Xiong and F. De-la-torre, Supervised descent method and its applications to face alignment, CVPR, 2013.

S. Yan, H. Wang, X. Tang, and T. S. Huang, Learning auto-structured regressor from uncertain nonnegative labels, CVPR, pp.1-8, 2007.

H. Yang, W. Mou, and Y. Zhang, Ioannis Patras, Hatice Gunes, and Peter Robinson. Face alignment assisted by head pose estimation, BMVC, 2015.

Y. Yang and D. Ramanan, Articulated human detection with flexible mixtures of parts, IEEE TPAMI, vol.35, issue.12, pp.2878-2890, 2013.

B. Yao and L. Fei-fei, Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses, IEEE TPAMI, 2012.

X. Yao, Evolving artificial neural networks, Proceedings of the IEEE, vol.87, issue.9, pp.1423-1447, 1999.

J. Yosinski and J. Clune, Yoshua Bengio, and Hod Lipson. How Transferable Are Features in Deep Neural Networks? In NIPS, pp.3320-3328, 2014.

X. Stella, J. Yu, and . Shi, Multiclass spectral clustering, ICCV, 2003.

S. Yun, A gaze control of socially interactive robots in multiple-person interaction, Robotica, vol.35, issue.11, pp.2122-2138, 2017.

. Matthew-d-zeiler, Adadelta: an adaptive learning rate method, 2012.

X. Zhen, Z. Wang, A. Islam, M. Bhaduri, I. Chan et al., Multi-scale deep networks and regression forests for direct bi-ventricular volume estimation, Med Image Anal, 2016.

X. Zhu and D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, CVPR, pp.2879-2886, 2012.