Skip to Main content Skip to Navigation

Budgeted Classification-based Policy Iteration

Victor Gabillon 1, 2
2 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : This dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs.
Document type :
Complete list of metadata

Cited literature [103 references]  Display  Hide  Download
Contributor : Preux Philippe Connect in order to contact the contributor
Submitted on : Monday, April 4, 2016 - 11:42:33 AM
Last modification on : Thursday, January 20, 2022 - 4:17:20 PM
Long-term archiving on: : Tuesday, July 5, 2016 - 12:41:44 PM


  • HAL Id : tel-01297386, version 1


Victor Gabillon. Budgeted Classification-based Policy Iteration. Machine Learning [stat.ML]. Universite Lille 1, 2014. English. ⟨tel-01297386⟩



Les métriques sont temporairement indisponibles