From images and sounds to face localization and tracking : a switching dynamical Bayesian framework

Vincent Drouard 1
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
Abstract : In this thesis, we address the well-known problem of head-pose estimationin the context of human-robot interaction (HRI). We accomplish this taskin a two step approach. First, we focus on the estimation of the head pose from visual features. We design features that could represent the face under different orientations and various resolutions in the image. The resulting is a high-dimensional representation of a face from an RGB image. Inspired from [Deleforge 15] we propose to solve the head-pose estimation problem by building a link between the head-pose parameters and the high-dimensional features perceived by a camera. This link is learned using a high-to-low probabilistic regression built using probabilistic mixture of affine transformations. With respect to classic head-pose estimation methods we extend the head-pose parameters by adding some variables to take into account variety in the observations (e.g. misaligned face bounding-box), to obtain a robust method under realistic conditions. Evaluation of the methods shows that our approach achieve better results than classic regression methods and similar results thanstate of the art methods in head pose that use additional cues to estimate the head pose (e.g depth information). Secondly, we propose a temporal model by using tracker ability to combine information from both the present and the past. Our aim through this is to give a smoother estimation output, and to correct oscillations between two consecutives independent observations. The proposed approach embeds the previous regression into a temporal filtering framework. This extention is part of the family of switching dynamic models and keeps all the advantages of the mixture of affine regressions used. Overall the proposed tracker gives a more accurate and smoother estimation of the head pose on a video sequence. In addition, the proposed switching dynamic model gives better results than standard tracking models such as Kalman filter. While being applied to the head-pose estimation problem the methodology presented in this thesis is really general and can be used to solve various regression and tracking problems, e.g. we applied it to the tracking of a sound source in an image.
Complete list of metadatas

Cited literature [51 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01667740
Contributor : Abes Star <>
Submitted on : Thursday, September 27, 2018 - 11:22:07 PM
Last modification on : Saturday, December 29, 2018 - 1:14:01 AM

File

DROUARD_2017_archivage.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01667740, version 2

Collections

Citation

Vincent Drouard. From images and sounds to face localization and tracking : a switching dynamical Bayesian framework. Computer Vision and Pattern Recognition [cs.CV]. Université Grenoble Alpes, 2017. English. ⟨NNT : 2017GREAM094⟩. ⟨tel-01667740v2⟩

Share

Metrics

Record views

291

Files downloads

98