Configurations spatiales et segmentation pour la compréhension de scènes, application à la ré-identification

Robin Deléarde

Résumé

Modeling the spatial configuration of objects in an image is a subject that is still little discussed to date, including in the most modern computer vision approaches such as convolutional neural networks ,(CNN). However, it is an essential aspect of scene perception, and integrating it into the models should benefit many tasks in the field, by helping to bridge the “semantic gap” between the digital image and the interpretation of its content. Thus, this thesis aims to improve spatial configuration modeling ,techniques, in order to exploit it in description and recognition systems. ,First, we looked at the case of the spatial configuration between two objects, by proposing an improvement of an existing descriptor. This new descriptor called “force banner” is an extension of the histogram of the same name to a whole range of forces, which makes it possible to better describe complex configurations. We were able to show its interest in the description of scenes, by learning toautomatically classify relations in natural language from pairs of segmented objects. We then tackled the problem of the transition to scenes containing several objects and proposed an approach per object by confronting each object with all the others, rather than having one descriptor per pair. Secondly, the industrial context of this thesis led us to deal with an application to the problem of re-identification of scenes or objects, a task which is similar to fine recognition from few examples. To do so, we rely on a traditional approach by describing scene components with different descriptors dedicated to specific characteristics, such as color or shape, to which we add the spatial configuration. The comparison of two scenes is then achieved by matching their components thanks to these characteristics, using the Hungarian algorithm for instance. Different combinations of characteristics can be considered for the matching and for the final score, depending on the present and desired invariances. For each one of these two topics, we had to cope with the problems of data and segmentation. We then generated and annotated a synthetic dataset, and exploited two existing datasets by segmenting them, in two different frameworks. The first approach concerns object-background segmentation and more precisely the case where a detection is available, which may help the segmentation. It consists in using an existing global segmentation model and exploiting the detection to select the right segment, by using several geometric and semantic criteria. The second approach concerns the decomposition of a scene or an object into parts and addresses the unsupervised case. It is based on the color of the pixels, by using a clustering method in an adapted color space, such as the HSV cone that we used. All these works have shown the possibility of using the spatial configuration for the description of real scenes containing several objects, as well as in a complex processing chain such as the one we used for re-identification. In particular, the force histogram could be used for this, which makes it possible to take advantage of its good performance, by using a segmentation method adapted to the use case when processing natural images.

La modélisation de la configuration spatiale des objets d’une image est un sujet encore peu abordé à ce jour, y compris dans les approches les plus modernes de vision par ordinateur comme les réseaux convolutionnels (CNN). Pourtant, il s’agit d’un aspect essentiel de la perception des scènes, et l’intégrer dans les modélisations devrait bénéficier à de nombreuses tâches du domaine, en contribuant à combler le "fossé sémantique" entre l’image numérique et l’interprétation de son contenu. Ainsi, cette thèse a pour objet l’amélioration des techniques de modélisation de la configuration spatiale, afin de l’exploiter dans des systèmes de description et de reconnaissance. Dans un premier temps, nous nous sommes penchés sur le cas de la configuration spatiale entre deux objets, en proposant une amélioration d’un descripteur existant. Ce nouveau descripteur appelé "bandeau de forces" est une extension de l’histogramme du même nom à tout un panel de forces, ce qui permet de mieux décrire les configurations complexes. Nous avons pu montrer son intérêt pour la description de scènes, en apprenant à classifier automatiquement des relations en langage naturel à partir de paires d’objets segmentés. Nous avons alors abordé la problématique du passage à des scènes comportant plusieurs objets, proposant une approche par objet en confrontant chaque objet à l’ensemble des autres, plutôt qu’en ayant un descripteur par paire. Dans un second temps, le contexte CIFRE nous a amenés à traiter une application au problème de la ré-identification de scènes ou d’objets, tâche qui s’apparente à la reconnaissance fine à partir de peu d’exemples. Pour cela, nous nous basons sur une approche traditionnelle en décrivant les constituants de la scène par différents descripteurs dédiés à des caractéristiques spécifiques, comme la couleur ou la forme, auxquelles nous ajoutons la configuration spatiale. La comparaison de deux scènes se fait alors en appariant leurs constituants grâce à ces caractéristiques, en utilisant par exemple l’algorithme hongrois. Différentes associations de caractéristiques peuvent être considérées pour l’appariement et pour le calcul du score final, selon les invariances présentes et recherchées. Pour chacun de ces deux sujets, nous avons été confrontés aux problèmes des données et de la segmentation. Nous avons alors généré et annoté un jeu de données synthétiques, et exploité deux jeux de données existants en les segmentant, dans deux cadres différents. La première approche concerne la segmentation objet-fond et se place dans le cas où une détection est disponible, ce qui permet d’aider la segmentation. Elle consiste à utiliser un modèle existant de segmentation globale, puis à exploiter la détection pour sélectionner le bon segment, à l’aide de plusieurs critères géométriques et sémantiques. La seconde approche concerne la décomposition d’une scène ou d’un objet en parties et se place dans le cas non supervisé. Elle se base alors sur la couleur des pixels, en utilisant une méthode par clustering dans un espace de couleur adapté, comme le cône HSV que nous avons utilisé. Tous ces travaux ont permis de montrer la possibilité d’utiliser la configuration spatiale pour la description de scènes réelles contenant plusieurs objets, ainsi que dans une chaîne de traitements complexe comme celle utilisée pour la ré-identification. En particulier, l’histogramme de forces a pu être utilisé pour cela, ce qui permet de profiter de ses bonnes performances, en utilisant une méthode de segmentation adaptée au cas d’usage pour traiter des images naturelles.

Spatial configurations and segmentation for scene understanding, application to re-identification

Configurations spatiales et segmentation pour la compréhension de scènes, application à la ré-identification

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager