# Learning to Recognize Actions with Weak Supervision

Abstract : With the rapid growth of digital video content, automaticvideo understanding has become an increasingly important task. Video understanding spansseveral applications such as web-video content analysis, autonomous vehicles, human-machine interfaces (eg, Kinect). This thesismakes contributions addressing two major problems in video understanding:webly-supervised action detection and human action localization.Webly-supervised action recognition aims to learn actions from video content on the internet, with no additional supervision. We propose a novel approach in this context, which leverages thesynergy between visual video data and the associated textual metadata, to learnevent classifiers with no manual annotations. Specifically, we first collect avideo dataset with queries constructed automatically from textual descriptionof events, prune irrelevant videos with text and video data, and then learn thecorresponding event classifiers. We show the importance of both the main steps of our method, ie,query generation and data pruning, with quantitative results. We evaluate this approach in the challengingsetting where no manually annotated training set is available, i.e., EK0 in theTrecVid challenge, and show state-of-the-art results on MED 2011 and 2013datasets.In the second part of the thesis, we focus on human action localization, which involves recognizing actions that occur in a video, such as drinking'' or phoning'', as well as their spatial andtemporal extent. We propose a new person-centric framework for action localization that trackspeople in videos and extracts full-body human tubes, i.e., spatio-temporalregions localizing actions, even in the case of occlusions or truncations.The motivation is two-fold. First, it allows us to handle occlusions and camera viewpoint changes when localizing people, as it infers full-body localization. Second, it provides a better reference grid for extracting action information than standard human tubes, ie, tubes which frame visible parts only.This is achieved by training a novel human part detector that scores visibleparts while regressing full-body bounding boxes, even when they lie outside the frame. The core of our method is aconvolutional neural network which learns part proposals specific to certainbody parts. These are then combined to detect people robustly in each frame.Our tracking algorithm connects the image detections temporally to extractfull-body human tubes. We evaluate our new tube extraction method on a recentchallenging dataset, DALY, showing state-of-the-art results.
Keywords :
Document type :
Theses

Cited literature [187 references]

https://tel.archives-ouvertes.fr/tel-01893147
Contributor : Abes Star :  Contact
Submitted on : Thursday, October 11, 2018 - 10:31:20 AM
Last modification on : Wednesday, November 4, 2020 - 3:22:37 PM
Long-term archiving on: : Saturday, January 12, 2019 - 1:21:17 PM

### File

CHESNEAU_2018_archivage.pdf
Version validated by the jury (STAR)

### Identifiers

• HAL Id : tel-01893147, version 1

### Citation

Nicolas Chesneau. Learning to Recognize Actions with Weak Supervision. Modeling and Simulation. Université Grenoble Alpes, 2018. English. ⟨NNT : 2018GREAM007⟩. ⟨tel-01893147⟩

### Metrics

Les métriques sont temporairement indisponibles