Learning from motion

Pavel Tokmakov 1, 2
2 Thoth - Apprentissage de modèles à partir de données massives
LJK - Laboratoire Jean Kuntzmann, Inria Grenoble - Rhône-Alpes
Abstract : Weakly-supervised learning studies the problem of minimizing the amount of human effort required for training state-of-the-art models. This allows to leverage a large amount of data. However, in practice weakly-supervised methods perform significantly worse than their fully-supervised counterparts. This is also the case in deep learning, where the top-performing computer vision approaches remain fully-supervised, which limits their usage in real world applications. This thesis attempts to bridge the gap between weakly-supervised and fully-supervised methods by utilizing motion information. It also studies the problem of moving object segmentation itself, proposing one of the first learning-based methods for this task.We focus on the problem of weakly-supervised semantic segmentation. This is especially challenging due to the need to precisely capture object boundaries and avoid local optima, as for example segmenting the most discriminative parts. In contrast to most of the state-of-the-art approaches, which rely on static images, we leverage video data with object motion as a strong cue. In particular, our method uses a state-of-the-art video segmentation approach to segment moving objects in videos. The approximate object masks produced by this method are then fused with the semantic segmentation model learned in an EM-like framework to infer pixel-level semantic labels for video frames. Thus, as learning progresses, the quality of the labels improves automatically. We then integrate this architecture with our learning-based approach for video segmentation to obtain a fully trainable framework for weakly-supervised learning from videos.In the second part of the thesis we study unsupervised video segmentation, the task of segmenting all the objects in a video that move independently from the camera. This task presents challenges such as strong camera motion, inaccuracies in optical flow estimation and motion discontinuity. We address the camera motion problem by proposing a learning-based method for motion segmentation: a convolutional neural network that takes optical flow as input and is trained to segment objects that move independently from the camera. It is then extended with an appearance stream and a visual memory module to improve temporal continuity. The appearance stream capitalizes on the semantic information which is complementary to the motion information. The visual memory module is the key component of our approach: it combines the outputs of the motion and appearance streams and aggregates a spatio-temporal representation of the moving objects. The final segmentation is then produced based on this aggregated representation. The resulting approach obtains state-of-the-art performance on several benchmark datasets, outperforming the concurrent deep learning and heuristic-based methods.
Complete list of metadatas

Contributor : Abes Star <>
Submitted on : Tuesday, October 30, 2018 - 3:34:09 PM
Last modification on : Wednesday, May 15, 2019 - 10:39:19 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01908817, version 1



Pavel Tokmakov. Learning from motion. Computer Vision and Pattern Recognition [cs.CV]. Université Grenoble Alpes, 2018. English. ⟨NNT : 2018GREAM031⟩. ⟨tel-01908817⟩



Record views


Files downloads