Gaze direction in the context of social human-robot interaction

Benoît Massé 1, 2
2 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
Abstract : Robots are more and more used in a social context. They are required notonly to share physical space with humans but also to interact with them. Inthis context, the robot is expected to understand some verbal and non-verbalambiguous cues, constantly used in a natural human interaction. In particular,knowing who or what people are looking at is a very valuable information tounderstand each individual mental state as well as the interaction dynamics. Itis called Visual Focus of Attention or VFOA. In this thesis, we are interestedin using the inputs from an active humanoid robot – participating in a socialinteraction – to estimate who is looking at whom or what.On the one hand, we want the robot to look at people, so it can extractmeaningful visual information from its video camera. We propose a novelreinforcement learning method for robotic gaze control. The model is basedon a recurrent neural network architecture. The robot autonomously learns astrategy for moving its head (and camera) using audio-visual inputs. It is ableto focus on groups of people in a changing environment.On the other hand, information from the video camera images are used toinfer the VFOAs of people along time. We estimate the 3D head poses (lo-cation and orientation) for each face, as it is highly correlated with the gazedirection. We use it in two tasks. First, we note that objects may be lookedat while not being visible from the robot point of view. Under the assump-tion that objects of interest are being looked at, we propose to estimate theirlocations relying solely on the gaze direction of visible people. We formulatean ad hoc spatial representation based on probability heat-maps. We designseveral convolutional neural network models and train them to perform a re-gression from the space of head poses to the space of object locations. Thisprovide a set of object locations from a sequence of head poses. Second, wesuppose that the location of objects of interest are known. In this context, weintroduce a Bayesian probabilistic model, inspired from psychophysics, thatdescribes the dependency between head poses, object locations, eye-gaze di-rections, and VFOAs, along time. The formulation is based on a switchingstate-space Markov model. A specific filtering procedure is detailed to inferthe VFOAs, as well as an adapted training algorithm.The proposed contributions use data-driven approaches, and are addressedwithin the context of machine learning. All methods have been tested on pub-licly available datasets. Some training procedures additionally require to sim-ulate synthetic scenarios; the generation process is then explicitly detailed.
Document type :
Complete list of metadatas

Cited literature [145 references]  Display  Hide  Download
Contributor : Abes Star <>
Submitted on : Monday, March 11, 2019 - 9:38:14 AM
Last modification on : Tuesday, March 26, 2019 - 5:45:56 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01936821, version 3



Benoît Massé. Gaze direction in the context of social human-robot interaction. Artificial Intelligence [cs.AI]. Université Grenoble Alpes, 2018. English. ⟨NNT : 2018GREAM055⟩. ⟨tel-01936821v3⟩



Record views


Files downloads