Skip to Main content Skip to Navigation

Exploration-exploitation dilemma in Reinforcement Learning under various form of prior knowledge

Ronan Fruit 1 
1 SEQUEL - Sequential Learning
Inria Lille - Nord Europe, CRIStAL - Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189
Abstract : In combination with Deep Neural Networks (DNNs), several Reinforcement Learning (RL) algorithms such as "Q-learning" or "Policy Gradient" are now able to achieve super-human performances on most Atari Games as well as the game of Go. Despite these outstanding and promising achievements, such Deep Reinforcement Learning (DRL) algorithms require millions of samples to perform well, thus limiting their deployment to all applications where data acquisition is costly. The lack of sample efficiency of DRL can partly be attributed to the use of DNNs, which are known to be data-intensive in the training phase. But more importantly, it can be attributed to the type of Reinforcement Learning algorithm used, which only perform a very inefficient undirected exploration of the environment. For instance, Q-learning and Policy Gradient rely on randomization for exploration. In most cases, this strategy turns out to be very ineffective to properly balance the exploration needed to discover unknown and potentially highly rewarding regions of the environment, with the exploitation of rewarding regions already identified as such. Other RL approaches with theoretical guarantees on the exploration-exploitation trade-off have been investigated. It is sometimes possible to formally prove that the performances almost match the theoretical optimum. This line of research is inspired by the Multi-Armed Bandit literature, with many algorithms relying on the same underlying principle often referred as "optimism in the face of uncertainty". Even if a significant effort has been made towards understanding the exploration-exploitation dilemma generally, many questions still remain open. In this thesis, we generalize existing work on exploration-exploitation to different contexts with different amounts of prior knowledge on the learning problem. We introduce several algorithmic improvements to current state-of-the-art approaches and derive a new theoretical analysis which allows us to answer several open questions of the literature. We then relax the (very common although not very realistic) assumption that a path between any two distinct regions of the environment should always exist. Relaxing this assumption highlights the impact of prior knowledge on the intrinsic limitations of the exploration-exploitation dilemma. Finally, we show how some prior knowledge such as the range of the value function or a set of macro-actions can be efficiently exploited to speed-up learning. In this thesis, we always strive to take the algorithmic complexity of the proposed algorithms into account. Although all these algorithms are somehow computationally "efficient", they all require a planning phase and therefore suffer from the well-known "curse of dimensionality" which limits their applicability to real-world problems. Nevertheless, the main focus of this work is to derive general principles that may be combined with more heuristic approaches to help overcome current DRL flaws.
Complete list of metadata

Cited literature [109 references]  Display  Hide  Download
Contributor : Ronan Fruit Connect in order to contact the contributor
Submitted on : Monday, January 27, 2020 - 10:47:55 PM
Last modification on : Thursday, March 24, 2022 - 3:42:41 AM
Long-term archiving on: : Tuesday, April 28, 2020 - 10:46:28 PM


Files produced by the author(s)


  • HAL Id : tel-02388395, version 2


Ronan Fruit. Exploration-exploitation dilemma in Reinforcement Learning under various form of prior knowledge. Artificial Intelligence [cs.AI]. Université de Lille 1, Sciences et Technologies; CRIStAL UMR 9189, 2019. English. ⟨tel-02388395v2⟩



Record views


Files downloads