Skip to Main content Skip to Navigation

Structured Models for Action Recognition in Real-word Videos

Adrien Gaidon 1 
1 LEAR - Learning and recognition in vision
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology
Abstract : This dissertation introduces novel models to recognize broad action categories --- like "opening a door" and "running" --- in real-world video data such as movies and internet videos. In particular, we investigate how an action can be decomposed, what is its discriminative structure, and how to use this information to accurately represent video content. The main challenge we address lies in how to build models of actions that are simultaneously information-rich --- in order to correctly differentiate between different action categories --- and robust to the large variations in actors, actions, and videos present in real-world data. We design three robust models capturing both the content of and the relations between action parts. Our approach consists in structuring collections of robust local features --- such as spatio-temporal interest points and short-term point trajectories. We also propose efficient kernels to compare our structured action representations. Even if they share the same principles, our methods differ in terms of the type of problem they address and the structure information they rely on. We, first, propose to model a simple action as a sequence of meaningful atomic temporal parts. We show how to learn a flexible model of the temporal structure and how to use it for the problem of action localization in long unsegmented videos. Extending our ideas to the spatio-temporal structure of more complex activities, we, then, describe a large-scale unsupervised learning algorithm used to hierarchically decompose the motion content of videos. We leverage the resulting tree-structured decompositions to build hierarchical action models and provide an action kernel between unordered binary trees of arbitrary sizes. Instead of structuring action models, we, finally, explore another route: directly comparing models of the structure. We view short-duration actions as high-dimensional time-series and investigate how an action's temporal dynamics can complement the state-of-the-art unstructured models for action classification. We propose an efficient kernel to compare the temporal dependencies between two actions and show that it provides useful complementary information to the traditional bag-of-features approach. In all three cases, we conducted thorough experiments on some of the most challenging benchmarks used by the action recognition community. We show that each of our methods significantly outperforms the related state of the art, thus highlighting the importance of structure information for accurate and robust action recognition in real-world videos.
Document type :
Complete list of metadata

Cited literature [171 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Thursday, January 24, 2013 - 3:52:08 PM
Last modification on : Friday, March 25, 2022 - 9:43:22 AM
Long-term archiving on: : Thursday, April 25, 2013 - 3:54:52 AM


Version validated by the jury (STAR)


  • HAL Id : tel-00780679, version 1



Adrien Gaidon. Structured Models for Action Recognition in Real-word Videos. General Mathematics [math.GM]. Université de Grenoble, 2012. English. ⟨NNT : 2012GRENM076⟩. ⟨tel-00780679⟩



Record views


Files downloads