Human action recognition in videos

Piotr Tadeusz Biliński 1
1 STARS - Spatio-Temporal Activity Recognition Systems
CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : This thesis targets the automatic recognition of human actions in videos. Human action recognition is defined as a requirement to determine what human actions occur in videos. This problem is particularly hard due to enormous variations in visual and motion appearance of people and actions, camera viewpoint changes, moving background, occlusions, noise, and enormous amount of video data. Firstly, we review, evaluate, and compare the most popular and the most prominent state-of-the-art techniques, and we propose our action recognition framework based on local features, which we use throughout this thesis work embedding the novel algorithms. Moreover, we introduce a new dataset (CHU Nice Hospital) with daily self care actions of elder patients in a hospital. Then, we propose two local spatio-temporal descriptors for action recognition in videos. The first descriptor is based on a covariance matrix representation, and it models linear relations between low-level features. The second descriptor is based on a Brownian covariance, and it models all kinds of possible relations between low-level features. Then, we propose three higher-level feature representations to go beyond the limitations of the local feature encoding techniques. The first representation is based on the idea of relative dense trajectories. We propose an object-centric local feature representation of motion trajectories, which allows to use the spatial information by a local feature encoding technique. The second representation encodes relations among local features as pairwise features. The main idea is to capture the appearance relations among features (both visual and motion), and use geometric information to describe how these appearance relations are mutually arranged in the spatio-temporal space. The third representation captures statistics of pairwise co-occurring visual words within multi-scale feature-centric neighbourhoods. The proposed contextual features based representation encodes information about local density of features, local pairwise relations among the features, and spatio-temporal order among features. Finally, we show that the proposed techniques obtain better or similar performance in comparison to the state-of-the-art on various, real, and challenging human action recognition datasets (Weizmann, KTH, URADL, MSR Daily Activity 3D, HMDB51, and CHU Nice Hospital).
Document type :
Complete list of metadatas

Cited literature [149 references]  Display  Hide  Download
Contributor : Abes Star <>
Submitted on : Monday, March 23, 2015 - 3:52:05 PM
Last modification on : Thursday, September 27, 2018 - 1:18:27 AM
Long-term archiving on: Thursday, July 2, 2015 - 6:30:27 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01134481, version 1



Piotr Tadeusz Biliński. Human action recognition in videos. Other [cs.OH]. Université Nice Sophia Antipolis, 2014. English. ⟨NNT : 2014NICE4125⟩. ⟨tel-01134481⟩



Record views


Files downloads