Méthodes d'apprentissage statistique pour le ranking théorie, algorithmes et applications

Sylvain Robbiano

Résumé

Multipartite ranking is a statistical learning problem that consists in ordering observations that belong to a high dimensional feature space in the same order as the labels, so that the observations with the highest label appear at the top on the list. This work aims to understand the probabilistic nature of the multipartite ranking problem in order to obtain theoretical guaranties for ranking algorithms. In that framework, the output of a ranking algorithm takes the form of a scoring function, a function that maps the space of the observations to the real line which order is induced using the values on the real line. The contributions of this manuscript are the following : first we focus on the characterization of the optimal solutions of multipartite ranking. A new condition on the likelihood ratios is introduced and shown to be necessary and sufficient to make the multipartite ranking well-posed. Then, we look at the criteria to assess the scoring function and propose to use a generalization of the ROC curve named the ROC surface. To be used in applications, the empirical counterpart of the ROC surface is studied and results on its consistency are stated. The second topic of research is the design of algorithms to produce scoring functions. The first procedure is based on the aggregation of scoring functions learnt from bipartite sub-problems. To the aim of aggregating the orders induced by the scoring function, we use a metric approach based on the Kendall- to find a median scoring function. The second procedure is a tree-based recursive method inspired by the TreeRank algorithm that can be viewed as a weighted version of CART. A simple modification is proposed to obtain an approximation of the optimal ROC surface using a piecewise constant scoring function. These procedures are compared to the state of the art algorithms for multipartite ranking using simulated and real data sets. The performances highlight the cases where our procedures are well-adapted, specifically when the dimension of the features space is much larger than the number of labels. Last but not least, we come back to the bipartite ranking problem in order to derive adaptive minimax rates of convergence. These rates are established for classes of distributions controlled by the complexity of the posterior distribution and a low noise condition. The procedure that achieves these rates is based on plug-in estimators of the posterior distribution and an aggregation using exponential weights.

Le ranking multipartite est un problème d'apprentissage statistique qui consiste à ordonner les observations qui appartiennent à un espace de grande dimension dans le même ordre que les labels, de sorte que les observations avec le label le plus élevé apparaissent en haut de la liste. Cette thèse vise à comprendre la nature probabiliste du problème de ranking multipartite afin d'obtenir des garanties théoriques pour les algorithmes de ranking. Dans ce cadre, la sortie d'un algorithme de ranking prend la forme d'une fonction de scoring, une fonction qui envoie l'espace des observations sur la droite réelle et l'ordre final est construit en utilisant l'ordre induit par la droite réelle. Les contributions de ce manuscrit sont les suivantes : d'abord, nous nous concentrons sur la caractérisation des solutions optimales de ranking multipartite. Une nouvelle condition sur les rapports de vraisemblance est introduite et jugée nécessaire et suffisante pour rendre le problème de ranking multipartite bien posé. Ensuite, nous examinons les critères pour évaluer la fonction de scoring et on propose d'utiliser une généralisation de la courbe ROC nommée la surface ROC pour cela ainsi que le volume induit par cette surface. Pour être utilisée dans les applications, la contrepartie empirique de la surface ROC est étudiée et les résultats sur sa consistance sont établis. Le deuxième thème de recherche est la conception d'algorithmes pour produire des fonctions de scoring. La première procédure est basée sur l'agrégation des fonctions de scoring apprises sur des sous-problèmes de ranking binaire. Dans le but d'agréger les ordres induits par les fonctions de scoring, nous utilisons une approche métrique basée sur le de Kendall pour trouver une fonction de scoring médiane. La deuxième procédure est une méthode récursive, inspirée par l'algorithme TreeRank qui peut être considéré comme une version pondérée de CART. Une simple modification est proposée pour obtenir une approximation de la surface ROC optimale en utilisant une fonction de scoring constante par morceaux. Ces procédures sont comparées aux algorithmes de l'état de l'art pour le ranking multipartite en utilisant des jeux de données réelles et simulées. Les performances mettent en évidence les cas où nos procédures sont bien adaptées, en particulier lorsque la dimension de l'espace des caractéristiques est beaucoup plus grand que le nombre d'étiquettes. Enfin, nous revenons au problème de ranking binaire afin d'établir des vitesses minimax adaptatives de convergence. Ces vitesses sont montrées pour des classes de distributions contrôlées par la complexité de la distribution a posteriori et une condition de faible bruit. La procédure qui permet d'atteindre ces taux est basée sur des estimateurs de type plug-in de la distribution a posteriori et une méthode d'agrégation utilisant des poids exponentiels.

Méthodes d'apprentissage statistique pour le ranking théorie, algorithmes et applications

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager