Exploration-exploitation dilemma in Reinforcement Learning under various form of prior knowledge

Ronan Fruit

Résumé

In combination with Deep Neural Networks (DNNs), several Reinforcement Learning (RL) algorithms such as "Q-learning" or "Policy Gradient" are now able to achieve super-human performances on most Atari Games as well as the game of Go. Despite these outstanding and promising achievements, such Deep Reinforcement Learning (DRL) algorithms require millions of samples to perform well, thus limiting their deployment to all applications where data acquisition is costly. The lack of sample efficiency of DRL can partly be attributed to the use of DNNs, which are known to be data-intensive in the training phase. But more importantly, it can be attributed to the type of Reinforcement Learning algorithm used, which only perform a very inefficient undirected exploration of the environment. For instance, Q-learning and Policy Gradient rely on randomization for exploration. In most cases, this strategy turns out to be very ineffective to properly balance the exploration needed to discover unknown and potentially highly rewarding regions of the environment, with the exploitation of rewarding regions already identified as such. Other RL approaches with theoretical guarantees on the exploration-exploitation trade-off have been investigated. It is sometimes possible to formally prove that the performances almost match the theoretical optimum. This line of research is inspired by the Multi-Armed Bandit literature, with many algorithms relying on the same underlying principle often referred as "optimism in the face of uncertainty". Even if a significant effort has been made towards understanding the exploration-exploitation dilemma generally, many questions still remain open. In this thesis, we generalize existing work on exploration-exploitation to different contexts with different amounts of prior knowledge on the learning problem. We introduce several algorithmic improvements to current state-of-the-art approaches and derive a new theoretical analysis which allows us to answer several open questions of the literature. We then relax the (very common although not very realistic) assumption that a path between any two distinct regions of the environment should always exist. Relaxing this assumption highlights the impact of prior knowledge on the intrinsic limitations of the exploration-exploitation dilemma. Finally, we show how some prior knowledge such as the range of the value function or a set of macro-actions can be efficiently exploited to speed-up learning. In this thesis, we always strive to take the algorithmic complexity of the proposed algorithms into account. Although all these algorithms are somehow computationally "efficient", they all require a planning phase and therefore suffer from the well-known "curse of dimensionality" which limits their applicability to real-world problems. Nevertheless, the main focus of this work is to derive general principles that may be combined with more heuristic approaches to help overcome current DRL flaws.

Combinés à des réseaux de neurones profonds ("Deep Neural Networks"), certains algorithmes d'apprentissage par renforcement tels que "Q-learning" ou "Policy Gradient" sont désormais capables de battre les meilleurs joueurs humains à la plupart des jeux de console Atari ainsi qu'au jeu de Go. Malgré des résultats spectaculaires et très prometteurs, ces méthodes d'apprentissage par renforcement dit "profond" ("Deep Reinforcement Learning") requièrent un nombre considérable d'observations pour apprendre, limitant ainsi leur déploiement partout où l'obtention de nouveaux échantillons s'avère coûteuse. Le manque d'efficacité de tels algorithmes dans l'exploitation des échantillons peut en partie s'expliquer par l'utilisation de réseaux de neurones profonds, connus pour être très gourmands en données. Mais il s'explique surtout par le recours à des algorithmes de renforcement explorant leur environnement de manière inefficace et non ciblée. Ainsi, des algorithmes tels que Q-learning ou encore Policy-Gradient exécutent des actions partiellement randomisées afin d'assurer une exploration suffisante. Cette stratégie est dans la plupart des cas inappropriée pour atteindre un bon compromis entre l'exploration indispensable à la découverte de nouvelles régions avantageuses (aux récompenses élevées), et l'exploitation de régions déjà identifiées comme telles. D'autres approches d'apprentissage par renforcement ont été développées, pour lesquelles il est possible de garantir un meilleur compromis exploration-exploitation, parfois proche de l'optimum théorique. Cet axe de recherche s'inspire notamment de la littérature sur le cas particulier du problème du bandit manchot, avec des algorithmes s'appuyant souvent sur le principe "d'optimisme dans l'incertain". Malgré les nombreux travaux sur le compromis exploration-exploitation, beaucoup de questions restent encore ouvertes. Dans cette thèse, nous nous proposons de généraliser les travaux existants sur le compromis exploration-exploitation à des contextes différents, avec plus ou moins de connaissances a priori. Nous proposons plusieurs améliorations des algorithmes de l'état de l'art ainsi qu'une analyse théorique plus fine permettant de répondre à plusieurs questions ouvertes sur le compromis exploration-exploitation. Nous relâchons ensuite l'hypothèse peu réaliste (bien que fréquente) selon laquelle il existe toujours un chemin permettant de relier deux régions distinctes de l'environnement. Le simple fait de relâcher cette hypothèse permet de mettre en lumière l'impact des connaissances a priori sur les limites intrinsèques du compromis exploration-exploitation. Enfin, nous montrons comment certaines connaissances a priori comme l'amplitude de la fonction valeur ou encore des ensembles de macro-actions peuvent être exploitées pour accélérer l'apprentissage. Tout au long de cette thèse, nous nous sommes attachés à toujours tenir compte de la complexité algorithmique des différentes méthodes proposées. Bien que relativement efficaces, tous les algorithmes présentés nécessitent une phase de planification et souffrent donc du problème bien connu du "fléau de la dimension", ce qui limite fortement leur potentiel applicatif (avec les méthodes actuelles). L'objectif phare des présents travaux est d'établir des principes généraux pouvant être combinés avec des approches plus heuristiques pour dépasser les limites des algorithmes actuels.

Exploration-exploitation dilemma in Reinforcement Learning under various form of prior knowledge

Impact des connaissances a priori sur le compromis exploration-exploitation en apprentissage par renforcement

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager