Sequential decision problems in non-stationary environments

Yoan Russac

Résumé

The vanilla bandit model assumes thatthe rewards are independent andidentically distributed. However, this assumption is restrictive: it prevents from modelingevolving behaviors that are common inreal-world applications.In the medical domain, the efficiencyof a treatment is likely to diminish over time. The opening rate of news articles fades for aging news. Fashion trends and consumers preferences evolve rapidly. Any recommender system ignoring the non-stationarity ofthe distributions of rewards is likely to make suboptimal choices. The objective of this thesis is the study of stochastic banditalgorithms in non-stationary environments. There are several ways to include non-stationarity into bandit models. We first study a variant of the best arm identification problem where the learner seeks to identify the set of armsthat are better than a control arm in the presence of subpopulations. Those subpopulations can encode a temporal information (e.g. day of the week) and properly using them makes it possible to include non-stationarity in the pure exploration setting. We characterize the complexity of this learning task and propose optimal algorithms for solving it. We then propose theoretically grounded algorithms for minimizing the regret and discuss the exploration-exploitation trade-off the learner is facing in dynamically changing environments. Our findings concern three different settings: the well-known multi-armed bandit, the more general linear bandit but also generalized linear bandit. For each of those settings, we identify the technical challenges brought by non-stationarity.

La version classique du modèle de bandit suppose que les distributions de probabilité des récompenses sont indépendantes et identiquement distribuées. Pour autant, cette hypothèse est restrictive dans de nombreux cas, puisqu’elle ne permet pas de prendre en compte d’éventuels changements de comportements. Dans le domaine médical, l’efficacité d’un traitement peut diminuer au cours du temps. Pour un site internet d’information en temps réel, le taux de consultation d’une page diminue à raison de sa date d’ancienneté. Les tendances de mode et les préférences des consommateurs évoluent rapidement. Un algorithme de recommendation ignorant ces formes de non-stationarité est alors susceptible de faire des suggestions sous-optimales. Ainsi, l’objet de cette thèse est l’étude des algorithmes de bandits stochastiques dans des environnements non-stationnaires. La non-stationarité peut être incorporée de plusieurs manières dans les modèles de bandits. Dans un premier temps, nous étudions une variante du problème d’identification du meilleur bras. Cette variante correspond à un système d’apprentissage qui cherche à identifier l’ensemble des options qui sont meilleures qu’un bras de contrôle, et ce en présence de sous-populations. Entre autres, l’utilisation de sous-populations permet la modélisation de l’évolution temporelle des différents bras. Nous proposons ensuite des algorithmes avec des garanties théoriques fortes pour la minimisation du regret et étudions le compromis exploration-exploitation pour de tels environnements. Nos recherches portent sur trois modèles différents : le bandit classique multi-bras, le bandit linéaire ou encore le bandit linéaire généralisé. Nous examinons les spécificités de chacun de ces trois modèles, ainsi que les défis techniques liés à la non-stationarité.

Sequential decision problems in non-stationary environments

Problèmes de décision séquentielle dans des environnements non-stationnaires

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager