Skip to Main content Skip to Navigation

Modeling and recognizing interactions between people, objects and scenes

Abstract : In this thesis, we focus on modeling interactions between people, objects and scenes and show benefits of combining corresponding cues for improving both action classification and scene understanding. In the first part, we seek to exploit the scene and object context to improve action classification in still images. We explore alternative bag-of-features models and propose a method that takes advantage of the scene context. We then propose a new model exploiting the object context for action classification based on pairs of body part and object detectors. We evaluate our methods on our newly collected still image dataset as well as three other datasets for action classification and show performance close to the state of the art. In the second part of this thesis, we address the reverse problem and aim at using the contextual information provided by people to help object localization and scene understanding. We collect a new dataset of time-lapse videos involving people interacting with indoor scenes. We develop an approach to describe image regions by the distribution of human co-located poses and use this pose-based representation to improve object localization. We further demonstrate that people cues can improve several steps of existing pipelines for indoor scene understanding. Finally, we extend the annotation of our time-lapse dataset to 3D and show how to infer object labels for occupied 3D volumes of a scene. To summarize, the contributions of this thesis are the following: (i) we design action classification models for still images that take advantage of the scene and object context and we gather a new dataset to evaluate their performance, (ii) we develop a new model to improve object localization thanks to observations of people interacting with an indoor scene and test it on a new dataset centered on person, object and scene interactions, (iii) we propose the first method to evaluate the volumes occupied by different object classes in a room that allow us to analyze the current 3D scene understanding pipeline and identify its main source of errors.
Complete list of metadatas

Cited literature [186 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Friday, February 15, 2019 - 11:59:26 AM
Last modification on : Thursday, October 29, 2020 - 3:01:17 PM
Long-term archiving on: : Friday, May 17, 2019 - 12:55:47 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01256076, version 2



Vincent Delaitre. Modeling and recognizing interactions between people, objects and scenes. Computer Vision and Pattern Recognition [cs.CV]. Ecole normale supérieure - ENS PARIS, 2015. English. ⟨NNT : 2015ENSU0003⟩. ⟨tel-01256076v2⟩



Record views


Files downloads