Model-based 3D hand pose estimation from monocular video

Martin De de La Gorce La Gorce

Résumé

In this thesis we propose two methods that allow to recover automatically a full description of the 3d motion of a hand given a monocular video sequence of this hand. Using the information provided by the video, our aimto is to determine the full set of kinematic parameters that are required to describe the pose of the skeleton of the hand. This set of parameters is composed of the angles associate to each joint/articulation and the global position and orientation of the wrist. This problem is extremely challenging. The hand as many degrees of freedom and auto-occlusion are ubiquitous, which makes difficult the estimation of occluded or partially ocluded hand parts.In this thesis, we introduce two novel methods of increasing complexity that improve to certain extend the state-of-the-art for monocular hand tracking problem. Both are model-based methods and are based on a hand model that is fitted to the image. This process is guided by an objective function that defines some image-based measure of the hand projection given the model parameters. The fitting process is achieved through an iterative refinement technique that is based on gradient-descent and aims a minimizing the objective function. The two methos differ mainly by the choice of the hand model and of the cost function.The first method relies on a hand model made of ellipsoids and a simple discrepancy measure based on global color distributions of the hand and the background. The second method uses a triangulated surface model with texture and shading and exploits a robust distance between the synthetic and observed image as discrepancy measure.While computing the gradient of the discrepancy measure, a particular attention is given to terms related to the changes of visibility of the surface near self occlusion boundaries that are neglected in existing formulations. Our hand tracking method is not real-time, which makes interactive applications not yet possible. Increase of computation power of computers and improvement of our method might make real-time attainable.

Dans cette thèse sont présentées deux méthodes visant à obtenir automatiquement une description tridimensionnelle des mouvements d'une main étant donnée une séquence vidéo monoculaire de cette main. En utilisant l'information fournie par la vidéo, l'objectif est de déterminer l'ensemble des paramètres cinématiques nécessaires à la description de la configuration spatiale des différentes parties de la main. Cet ensemble de paramètres est composé des angles de chaque articulation ainsi que de la position et de l'orientation globale du poignet. Ce problème est un problème difficile. La main a de nombreux degrés de liberté et les auto-occultations sont omniprésentes, ce qui rend difficile l'estimation de la configuration des parties partiellement ou totalement cachées. Dans cette thèse sont proposées deux nouvelles méthodes qui améliorent par certains aspects l'état de l'art pour ce problème. Ces deux méthodes sont basées sur un modèle de la main dont la configuration spatiale est ajustée pour que sa projection dans l'image corresponde au mieux à l'image de main observée. Ce processus est guidé par une fonction de coût qui définit une mesure quantitative de la qualité de l'alignement de la projection du modèle avec l'image observée. La procédure d'ajustement du modèle est réalisée grâce à un raffinement itératif de type descente de gradient quasi-newton qui vise à minimiser cette fonction de coût.Les deux méthodes proposées diffèrent principalement par le choix du modèle et de la fonction du coût. La première méthode repose sur un modèle de la main composé d'ellipsoïdes et d'une fonction coût utilisant un modèle de la distribution statistique de la couleur la main et du fond de l'image.La seconde méthode repose sur un modèle triangulé de la surface de la main qui est texturé est ombragé. La fonction de coût mesure directement, pixel par pixel, la différence entre l'image observée et l'image synthétique obtenue par projection du modèle de la main dans l'image. Lors du calcul du gradient de la fonction de coût, une attention particulière a été portée aux termes dûs aux changements de visibilité de la surface au voisinage des auto-occultations, termes qui ont été négligés dans les méthodes préexistantes.Ces deux méthodes ne fonctionnement malheureusement pas en temps réel, ce qui rend leur utilisation pour l'instant impossible dans un contexte d'interaction homme-machine. L'amélioration de la performance des ordinateur combinée avec une amélioration de ces méthodes pourrait éventuellement permettre d'obtenir un résultat en temps réel.

Model-based 3D hand pose estimation from monocular video

Suivi automatique de la main à partir de séquences vidéo monoculaires

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager