Skip to Main content Skip to Navigation

Bandits Multi-bras avec retour d'information non-conventionnelle

Pratik Gajane 1
1 SEQUEL - Sequential Learning
Inria Lille - Nord Europe, CRIStAL - Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189
Résumé : The multi-armed bandit (MAB) problem is a mathematical formulation of the exploration-exploitation trade-off inherent to reinforcement learning, in which the learner chooses an action (symblized by an arm) from a set of available actions in a sequence of trials in order to maximize their reward. In the classical MAB prob- lem, the learner receives absolute bandit feedback i.e. it receives as feedback the reward of the arm it selects. In many practical situations however, different kind of feedback is more readily available. In this thesis, we study two of such kinds of feedbacks, namely, relative feedback and corrupt feedback. The main practical motivation behind relative feedback arises from the task of online ranker evaluation. This task involves choosing the optimal ranker from a fi- nite set of rankers using only pairwise comparisons, while minimizing the compar- isons between sub-optimal rankers. This is formalized by the MAB problem with relative feedback, in which the learner selects two arms instead of one and receives the preference feedback. We consider the adversarial formulation of this problem which circumvents the stationarity assumption over the mean rewards for the arms. We provide a lower bound on the performance measure for any algorithm for this problem. We also provide an algorithm called "Relative Exponential-weight algo- rithm for Exploration and Exploitation" with performance guarantees. We present a thorough empirical study on several information retrieval datasets that confirm the validity of these theoretical results. The motivating theme behind corrupt feedback is that the feedback the learner receives is a corrupted form of the corresponding reward of the selected arm. Prac- tically such a feedback is available in the tasks of online advertising, recommender systems etc. We consider two goals for the MAB problem with corrupt feedback: best arm identification and exploration-exploitation. For both the goals, we provide lower bounds on the performance measures for any algorithm. We also provide various algorithms for these settings. The main contribution of this module is the al- gorithms "KLUCB-CF" and "Thompson Sampling-CF" which asymptotically attain the best possible performance. We present experimental results to demonstrate the performance of these algorithms. We also show how this problem setting can be used for the practical application of enforcing differential privacy.
Complete list of metadatas

Cited literature [114 references]  Display  Hide  Download
Contributor : Preux Philippe <>
Submitted on : Thursday, February 22, 2018 - 8:10:41 AM
Last modification on : Friday, May 17, 2019 - 11:39:17 AM


Files produced by the author(s)


  • HAL Id : tel-01706640, version 2


Pratik Gajane. Bandits Multi-bras avec retour d'information non-conventionnelle . Artificial Intelligence [cs.AI]. Université de Lille, 2017. English. ⟨tel-01706640v2⟩



Record views


Files downloads