Skip to Main content Skip to Navigation

Spatio-temporal descriptors for human action recognition

Abstract : Due to increasing demand for video analysis systems in recent years, human action de-tection/recognition is being targeted by the research community in order to make video description more accurate and faster, especially for big datasets. The ultimate purpose of human action recognition is to discern automatically what is happening in any given video. This thesis aims to achieve this purpose by contributing to both action detection and recognition tasks. We thus have developed new description methods for human action recognition.For the action detection component we introduce two novel approaches for human action detection. The first proposition is a simple yet effective method that aims at detecting human movements. First, video sequences are segmented into Frame Packets (FPs) and Group of Interest Points (GIP). In this method we track the movements of Interest Points in simple controlled video datasets and then in videos of gradually increasing complexity. The controlled datasets generally contain videos with a static background and simple ac-tions performed by one actor. The more complex realistic datasets are collected from social networks.The second approach for action detection attempts to address the problem of human ac-tion recognition in realistic videos captured by moving cameras. This approach works by segmenting human motion, thus investigating the optimal sufficient frame number to per-form action recognition. Using this approach, we detect object edges using the canny edge detector. Next, we apply all the steps of the motion segmentation process to each frame. Densely distributed interest points are detected and extracted based on dense SURF points with a temporal step of N frames. Then, optical flows of the detected key points between two frames are computed by the iterative Lucas and Kanade optical flow technique, using pyramids. Since we are dealing with scenes captured by moving cameras, the motion of objects necessarily involves the background and/or the camera motion. Hence, we propose to compensate for the camera motion. To do so, we must first assume that camera motion exists if most points move in the same direction. Then, we cluster optical flow vectors using a KNN clustering algorithm in order to determine if the camera motion exists. If it does, we compensate for it by applying the affine transformation to each frame in which camera motion is detected, using as input parameters the camera flow magnitude and deviation. Finally, after camera motion compensation, moving objects are segmented using temporal differencing and a bounding box is drawn around each detected moving object. The action recognition framework is applied to moving persons in the bounding box. Our goal is to reduce the amount of data involved in motion analysis while preserving the most important structural features. We believe that we have performed action detection in the spatial and temporal domain in order to obtain better action detection and recognition while at the same time considerably reducing the processing time...
Document type :
Complete list of metadatas

Cited literature [153 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Tuesday, February 20, 2018 - 11:16:05 AM
Last modification on : Saturday, February 15, 2020 - 2:05:07 AM
Long-term archiving on: : Monday, May 21, 2018 - 12:41:46 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01713128, version 1



Sameh Megrhi. Spatio-temporal descriptors for human action recognition. Computers and Society [cs.CY]. Université Paris-Nord - Paris XIII, 2014. English. ⟨NNT : 2014PA131046⟩. ⟨tel-01713128⟩



Record views


Files downloads