Towards real-time image understanding with convolutional networks

Résumé : One of the open questions of artificial computer vision is how to produce good internal representations of the visual world. What sort of internal representation would allow an artificial vision system to detect and classify objects into categories, independently of pose, scale, illumination, conformation, and clutter ? More interestingly, how could an artificial vision system {em learn} appropriate internal representations automatically, the way animals and humans seem to learn by simply looking at the world ? Another related question is that of computational tractability, and more precisely that of computational efficiency. Given a good visual representation, how efficiently can it be trained, and used to encode new sensorial data. Efficiency has several dimensions: power requirements, processing speed, and memory usage. In this thesis I present three new contributions to the field of computer vision:(1) a multiscale deep convolutional network architecture to easily capture long-distance relationships between input variables in image data, (2) a tree-based algorithm to efficiently explore multiple segmentation candidates, to produce maximally confident semantic segmentations of images,(3) a custom dataflow computer architecture optimized for the computation of convolutional networks, and similarly dense image processing models. All three contributions were produced with the common goal of getting us closer to real-time image understanding. Scene parsing consists in labeling each pixel in an image with the category of the object it belongs to. In the first part of this thesis, I propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features. In parallel to feature extraction, a tree of segments is computed from a graph of pixel dissimilarities. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average "purity" of the class distributions, hence maximizing the overall likelihood that each segment contains a single object (...)
Type de document :
Computation and Language [cs.CL]. Université Paris-Est, 2013. English. <NNT : 2013PEST1083>
Liste complète des métadonnées
Contributeur : Abes Star <>
Soumis le : lundi 6 juin 2016 - 15:07:07
Dernière modification le : mardi 28 juin 2016 - 14:14:23


Version validée par le jury (STAR)


  • HAL Id : tel-01327221, version 1



Clément Farabet. Towards real-time image understanding with convolutional networks. Computation and Language [cs.CL]. Université Paris-Est, 2013. English. <NNT : 2013PEST1083>. <tel-01327221>



Consultations de
la notice


Téléchargements du document