Service interruption on Monday 11 July from 12:30 to 13:00: all the sites of the CCSD (HAL, Epiciences, SciencesConf, AureHAL) will be inaccessible (network hardware connection).
Skip to Main content Skip to Navigation

Weakly supervised methods for learning actions and objects

Alessandro Prest 1, 2 
1 LEAR - Learning and recognition in vision
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology
Abstract : Modern Computer Vision systems learn visual concepts through examples (i.e. images) which have been manually annotated by humans. While this paradigm allowed the field to tremendously progress in the last decade, it has now become one of its major bottlenecks. Teaching a new visual concept requires an expensive human annotation effort, limiting systems to scale to thousands of visual concepts from the few dozens that work today. The exponential growth of visual data available on the net represents an invaluable resource for visual learning algorithms and calls for new methods able to exploit this information to learn visual concepts without the need of major human annotation effort. As a first contribution, we introduce an approach for learning human actions as interac- tions between persons and objects in realistic images. By exploiting the spatial structure of human-object interactions, we are able to learn action models automatically from a set of still images annotated only with the action label (weakly-supervised). Extensive experimental evaluation demonstrates that our weakly-supervised approach achieves the same performance of popular fully-supervised methods despite using substantially less supervision. In the second part of this thesis we extend this reasoning to human-object interactions in realistic video and feature length movies. Popular methods represent actions with low- level features such as image gradients or optical flow. In our approach instead, interactions are modeled as the trajectory of the object wrt to the person position, providing a rich and natural description of actions. Our interaction descriptor is an informative cue on its own and is complimentary to traditional low-level features. Finally, in the third part we propose an approach for learning object detectors from real- world web videos (i.e. YouTube). As opposed to the standard paradigm of learning from still images annotated with bounding-boxes, we propose a technique to learn from videos known only to contain objects of a target class. We demonstrate that learning detec- tors from video alone already delivers good performance requiring much less supervision compared to training from images annotated with bounding boxes. We additionally show that training from a combination of weakly annotated videos and fully annotated still images improves over training from still images alone.
Complete list of metadata

Cited literature [116 references]  Display  Hide  Download
Contributor : Alessandro Prest Connect in order to contact the contributor
Submitted on : Thursday, November 29, 2012 - 12:52:27 PM
Last modification on : Thursday, January 20, 2022 - 5:30:56 PM
Long-term archiving on: : Saturday, December 17, 2016 - 5:44:44 PM


  • HAL Id : tel-00758797, version 1



Alessandro Prest. Weakly supervised methods for learning actions and objects. Computer science. Eidgenössische Technische Hochschule Zürich (ETHZ), 2012. English. ⟨tel-00758797⟩



Record views


Files downloads