Learning human actions in video

Alexander Klaser

Résumé

This dissertation targets the recognition of human actions in realistic video data, such as movies. To this end, we develop state-of-the-art feature extraction algorithms that robustly encode video information for both, action classification and action localization.

In a first part, we study bag-of-features approaches for action classification. Recent approaches that use bag-of-features as representation have shown excellent results in the case of realistic video data. We, therefore, conduct an extensive comparison of existing methods for local feature detection and description. We, then, propose two new approaches to describe local features in videos. The first method extends the concept of histograms over gradient orientations to the spatio-temporal domain. The second method describes trajectories of local interest points detected spatially. Both descriptors are evaluated in a bag-of-features setup and show an improvement over the state-of-the-art for action classification.

In a second part, we investigate how human detection can help action recognition. Firstly, we develop an approach that combines human detection with a bag-of-features model. The performance is evaluated for action classification with varying resolutions of spatial layout information. Next, we explore the spatio-temporal localization of human actions in Hollywood movies. We extend a human tracking approach to work robustly on realistic video data. Furthermore we develop an action representation that is adapted to human tracks. Our experiments suggest that action localization benefits significantly from human detection. In addition, our system shows a large improvement over current state-of-the-art approaches.

Cette thèse s'intéresse à la reconnaissance des actions humaines dans des données vidéo réalistes, tels que les films. À cette fin, nous développons des algorithmes d'extraction de caractéristiques visuelles pour la classification et la localisation d'actions.

Dans une première partie, nous étudions des approches basées sur les sacs-de-mots pour la classification d'action. Dans le cas de vidéo réalistes, certains travaux récents qui utilisent le modèle sac-de-mots pour la représentation d'actions ont montré des résultats prometteurs. Par conséquent, nous effectuons une comparaison approfondie des méthodes existantes pour la détection et la description des caractéristiques locales. Ensuite, nous proposons deux nouvelles approches pour la descriptions des caractéristiques locales en vidéo. La première méthode étend le concept d'histogrammes sur les orientations de gradient dans le domaine spatio-temporel. La seconde méthode est basée sur des trajectoires de points d'intérêt détectés spatialement. Les deux descripteurs sont évalués avec une représentation par sac-de-mots et montrent une amélioration par rapport à l'état de l'art pour la classification d'actions.

Dans une seconde partie, nous examinons comment la détection de personnes peut contribuer à la reconnaissance d'actions. Tout d'abord, nous développons une approche qui combine la détection de personnes avec une représentation sac-de-mots. La performance est évaluée pour la classification d'actions à plusieurs niveaux d'échelle spatiale. Ensuite, nous explorons la localisation spatio-temporelle des actions humaines dans les films. Nous étendons une approche de suivi de personnes pour des vidéos réalistes. En outre, nous développons une représentation d'actions qui est adaptée aux détections de personnes. Nos expériences suggèrent que la détection de personnes améliore significativement la localisation d'actions. De plus, notre système montre une grande amélioration par rapport à l'état de l'art actuel.

Learning human actions in video

Apprentissage pour la reconnaissance d'actions humaines en vidéo

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager