Skip to Main content Skip to Navigation

On the notion of optimality in the stochastic multi-armed bandit problems

Abstract : The topics addressed in this thesis lie in statistical machine learning and sequential statistic. Our main framework is the stochastic multi-armed bandit problems. In this work we revisit lower bounds on the regret. We obtain non-asymptotic, distribution-dependent bounds and provide simple proofs based only on well-known properties of Kullback-Leibler divergence. These bounds show in particular that in the initial phase the regret grows almost linearly, and that the well-known logarithmic growth of the regret only holds in a final phase. Then, we propose algorithms for regret minimization in stochastic bandit models with exponential families of distributions or with distribution only assumed to be supported by the unit interval, that are simultaneously asymptotically optimal (in the sense of Lai and Robbins lower bound) and minimax optimal. We also analyze the sample complexity of sequentially identifying the distribution whose expectation is the closest to some given threshold, with and without the assumption that the mean values of the distributions are increasing. This work is motivated by phase I clinical trials, a practically important setting where the arm means are increasing by nature. Finally we extend Fano's inequality, which controls the average probability of (disjoint) events in terms of the average of some Kullback-Leibler divergences, to work with arbitrary unit-valued random variables. Several novel applications are provided, in which the consideration of random variables is particularly handy. The most important applications deal with the problem of Bayesian posterior concentration (minimax or distribution-dependent) rates and with a lower bound on the regret in non-stochastic sequential learning.
Document type :
Complete list of metadatas

Cited literature [140 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Monday, May 6, 2019 - 4:33:06 PM
Last modification on : Saturday, August 15, 2020 - 3:58:10 AM
Long-term archiving on: : Tuesday, October 1, 2019 - 11:58:14 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02121614, version 1


Pierre Ménard. On the notion of optimality in the stochastic multi-armed bandit problems. Statistics [math.ST]. Université Paul Sabatier - Toulouse III, 2018. English. ⟨NNT : 2018TOU30087⟩. ⟨tel-02121614⟩



Record views


Files downloads