. .. Motivations, , p.192

, 2 The piece-wise stationary bandit model

, Review of related works

, The Bernoulli GLRT change-point Detector

. .. , A new algorithm for piece-wise stationary bandits, p.207

. .. , Finite-time upper-bounds on the regret of GLR-klUCB, p.210

, Proof of the regret upper-bounds

. .. , Experimental results for piece-wise stationary bandits, p.220

. .. Conclusion,

. ?-n-*, In practice, instead of sub-sampling for the time t, we propose to sub-sample for the number of samples of arm i before calling GLR to check for a change on arm i, that is, n i (t) in Algorithm 7.1. Note that the first heuristic using ?n can be applied to M-UCB as well as CUSUM-UCB and PHT-UCB

, The second optimization is in the same spirit, and uses a parameter ?s ? N * . When running the GLR test with data Z 1 , . . . , Z t , instead of considering every splitting time steps s ? [t], in the same spirit, we can skip some and test not at all time steps s but only every ?s time steps

. .. , xxvii 1.1 A chart representing the allocation of radio spectrum in the United States of North America in 2016, Cycle de l'apprentissage par renforcement : un-e joueur-se interagit avec son environnement par des actions, et observe une récompense, de façon itérative

, Reinforcement learning cycle: a learner interacts with its environment through actions, and observes a reward

, Organization of the thesis: a reading map

T. .. , Reinforcement learning cycle in a MAB model, for time steps t = 1, p.22

, Screenshot of the demonstration, at the end of the game after T = 100 steps, p.25

, 49 2.5 Mean regret R t as function of t for T = 10000 and N = 1000. The 3 most efficient algorithms: UCB 1 , kl-UCB and Thompson Sampling achieve logarithmic regret, Average of the cumulated rewards, as function of t, for T = 10000 and N = 1000, p.50

, Histogram of 10000 i.i.d. rewards obtained from three arms with a truncated Gaussian distribution, of respective means 0.1, 0.5 and 0.9

. .. , 59 instance, Thomson sampling is very efficient in average, and UCBshows a larger variance, p.62

, Regret vs different values of K

T. .. ,

. .. , Normalized mean regret vs normalized running time (in micro-seconds), p.69

, Normalized running time vs different values of K

T. .. , 70 3.10 Normalized mean regret vs normalized memory costs (in bytes), Normalized running time vs different values of, p.71

, Normalized memory cost vs different values of K

, Normalized memory cost vs different values of T

, Bernoulli problem (semilog-y scale), p.86

. .. , Bernoulli problem, they all have similar performances, except LearnExp, and our proposal Aggregator outperforms its competitors, vol.87

, On an "easy" Gaussian problem, only Aggregator shows reasonable performances, thanks to Bayes-UCB and Thompson sampling, vol.87

, Exponential arms, with 3 arms of each type with the same mean, On a harder problem, mixing Bernoulli, Gaussian

, The semilog-x scale clearly shows the logarithmic growth of the regret for the best algorithms and our proposal Aggregator, p.88

, 96 5.2 The considered time-frequency slotted protocol. Each frame is composed by a fixed duration up-link slot in which the end-devices transmit their (up-link) packets, our system model, some dynamic devices, p.97

. .. , 3 Performance of two MAB algorithms, UCB and Thompson Sampling in red, compared to extreme reference policies without learning or oracle knowledge, when the proportion of dynamic end-devices in the network increases, from 10% to 100%, p.105

, Learning with UCB and Thomson Sampling, with many dynamic devices, p.106

, Performance of the UCBbandit algorithm for the special case of uniform distribution of the static devices, when the proportion of intelligent devices in the network increases, from 10% to 100%

, Schematic of our implementation that presents the role of each USRP platform, p.110

, Two pictures showing the SCEE test-bed, p.111, 2018.

, User interface of our demonstration

, less than 100 trials in each channel) are sufficient for the two learning devices (UCB and Thompson Sampling) to reach a successful communications rate close to 80%, which is twice as much as the non-learning (uniform) device, which stays around 40% of success. Similar gains of performance were obtained in other scenarios

, Screenshot of the video of our demonstration, youtu.be/HospLNQhcMk

, The Markov model of the behavior of all devices paired to the considered IoT network using the ALOHA protocol

, Our approximation for the probability of collision at the second transmission, p.121

U. &. Random, &. Delayed, and U. &. , First comparison between the exposed heuristics for the retransmission

, Second comparison between the exposed heuristics for the retransmission

, The random traffic generator flow-graph

, The IoT base station flow-graph

. .. , 164 6.3 Regret for M = 6 players for K = 9 arms, horizon T = 5000, for 1000 repetitions on a fixed problem, The IoT dynamic device flow-graph

, Regret for M = 2 and 9 players for K = 9 arms, horizon T = 5000, for a fixed problem, p.166

, Regret for M = 3 players for K = 9 arms, horizon T = 123456, vol.100

, Regret for M = 6 players, K = 9 arms, horizon T = 5000, against 500 problems µ uniformly sampled

, Regret for M = 2 players, K = 9 arms, horizon T = 5000, against 500 problems µ uniformly sampled

. Regret-for-m-=-k-=-9, , vol.5000, p.170

, C = 4). The means are in [0, 1], and there are C + 1 = 5 stationary intervals of equal lengths. Some changes do not modify the optimal arm (e.g., at T = 1000 and T = 4000) and others do

, C = 12). The means are again in [0, 1], and there are also C + 1 = 5 stationary intervals of equal lengths, Problem 2: K = 3 arms with T = 5000, and ? = 4 changes occur on all arms (i.e

, Locations of the detected change-points for four algorithms on Problem 1, p.222

, Locations of the detected change-points for four algorithms on Problem 2, p.223

, Problem 3: K = 6, T = 20000, C = 19 changes occur on most arms at ? = 8 break-points, vol.224

, Problem 4: K = 3, T = 5000, C = 12 changes occur on all arms at ? = 4 break-points, p.225

, Pb 5: K = 5, T = 100000, C = 179 changes occur on some arms at ? = 81 break-points, vol.226

, Mean regret as a function of time, R t for horizon T = 5000, for problem 1, p.228

, Mean regret as a function of time, R t for horizon T = 5000, for problem 2, p.229

. .. , 230 7.11 Mean regret as a function of time, R t for horizon T = 5000, for problem 4. We see that after a "long enough" stationary interval, the algorithms designed for stationary problems lose track of the best arm, Histograms of the distributions of regret R T (T = 5000) for problem 1, p.232

, Mean regret as a function of time, R t for horizon T =, p.233, 20000.

. .. , 38 2.3 A generic index policy A, using indexes U k (t) (e.g., UCB 1 , kl-UCB etc), Mean regret as a function of time, R t for horizon T = 100000, p.40

. .. , Thompson Sampling for Bernoulli rewards, with Beta prior/posteriors, p.45

.. .. , 84 5.1 First-stage UCB and retransmission in same channel, The Aggregator algorithm, aggregating N MAB algorithms A 1

. .. , 152 6.2 The MCTopM decentralized learning policy (for an index policy U j ), p.152

. .. , 208 List of Code Examples 3.1 Example of Python code to create Bernoulli and Gaussian arms, a MAB problem with

. .. Smpybandits, K = 3 arms, and to plot a histogram of rewards, with, p.57

, Code defining the UCB 1 algorithm, as a simple example of an Index Policy, p.58

, Example of Bash code to download and install dependencies of SMPyBandits, p.60

, Example of Bash code to run a simple experiment with SMPyBandits, p.60

.. .. {5000, .. .. {2, and . .. 32}, , p.77

, 49 2.2 Cumulated rewards and regret, for horizons T = 100 and T = 10000, averaged over N = 1000 independent simulations (j = 1, List of Tables 2.1 Cumulated rewards and regret, for horizons T = 100 and T = 10000, p.50

, Using kl-UCB is much more efficient than using UCB, for multi-players bandit (here in a simple problem with K = 9 arms)

, All use kl-UCB, Mean regret ± 1 std-dev, for different algorithms on the same problem with M = 3, 6, 9, comparing algorithms which knows M against algorithms which estimate M on the fly

, 9 players, for the "no sensing" case. More work is needed on our implementation on Improved Musical Chair. The results on Sic-MMAB confirm the numerical experiments of, Comparison of the mean regret ± 1 std-dev, for different algorithms, on the same problem with M = 3, vol.6

, Comparing RhoRand and RhoLearn on a simple MP-MAB problem with K = 9 arms, vol.184

, Mean regret ± 1 std-dev, on problems 1 and 2 with T = 5000. We conclude that using kl-UCB is much more efficient than using UCB, for non-stationary bandit, p.226

, Mean regret ± 1 std-dev, p.227

, Problem 4 use K = 3 arms, and a first long stationary sequence. Problem 5 use K = 5, T = 100000 and is much harder with ? = 82 break-points and C = 179 changes, Mean regret ± 1 std-dev

, Using the optimizations with ?n = ?s = 20 does not reduce the regret much but speeds up the computations by about a factor 50, Effects of the two optimizations parameters ?n and ?s, on the mean regret R T (top) and mean computation time (bottom) for GLR-klUCB on a simple problem

, Mean regret ± 1 standard-deviation, for different choices of threshold function ?(n, ?), on three problems of horizon T = 5000, for GLR-klUCB

, Mean regret ± 1 standard-deviation, for different choices of exploration mechanisms, on three problems of horizon T = 5000, for GLR-klUCB, with local or global restarts, p.242

, List of References

J. Audibert and S. Bubeck, Minimax Policies for Adversarial and Stochastic Bandits, Conference on Learning Theory, pp.217-226, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00834882

J. Audibert and S. Bubeck, Regret Bounds And Minimax Policies Under Partial Monitoring, Journal of Machine Learning Research, vol.11, pp.2785-2836, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00654356

J. Audibert, S. Bubeck, and R. Munos, Best Arm Identification in Multi-Armed Bandits, Conference on Learning Theory, p.13, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00654404

N. Abramson, The ALOHA System: Another Alternative for Computer Communications, Fall Joint Computer Conference, AFIPS '70 (Fall), pp.281-285, 1970.

A. Azari and C. Cavdar, Self-organized Low-power IoT Networks: A Distributed Learning Approach, Global Communications Conference, 2018.

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time Analysis of the Multi-armed Bandit Problem, Machine Learning, vol.47, pp.235-256, 2002.

P. Auer, N. Cesa-bianchi, Y. Freund, and R. Schapire, Gambling in a Rigged Casino: The Adversarial Multi-Armed Bandit Problem, Annual Symposium on Foundations of Computer Science, pp.322-331, 1995.

P. Auer, N. Cesa-bianchi, Y. Freund, and R. Schapire, The Non-Stochastic Multi-Armed Bandit Problem, SIAM journal on computing, vol.32, issue.1, pp.48-77, 2002.

R. Allesiardo and R. Féraud, Exp3 with Drift Detection for the Switching Bandit Problem, International Conference on Data Science and Advanced Analytics, pp.1-7, 2015.

R. Allesiardo, R. Féraud, and O. Maillard, The Non-Stationary Stochastic Multi-Armed Bandit Problem, International Journal of Data Science and Analytics, vol.3, issue.4, pp.267-283, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01575000

S. Agrawal and N. Goyal, Analysis of Thompson sampling for the Multi-Armed Bandit problem, Conference On Learning Theory, pp.36-65, 2012.

P. Auer, P. Gajane, and R. Ortner, Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes, European Workshop on Reinforcement Learning, 2018.

R. , Sample mean based index policies by O(ln n) regret for the Multi-Armed Bandit problem, Advances in Applied Probability, vol.27, issue.4, pp.1054-1078, 1995.

S. Arora, E. Hazan, and S. Kale, The multiplicative weights update method: a metaalgorithm and applications, Theory of Computing, vol.8, pp.121-164, 2012.

S. Adish, H. Hassani, and A. Krause, Learning to Use Learners, 2017.

P. Alatur, K. Y. Levy, and A. Krause, Multi-Player Bandits: The Adversarial Case, 2019.

A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire, Corralling a Band of Bandit Algorithms, Conference on Learning Theory, pp.12-38, 2017.

I. F. Akyildiz, W. Lee, M. C. Vuran, and S. Mohanty, NeXt Generation, Dynamic Spectrum Access, Cognitive Radio Wireless Networks: A Survey, Computer Networks, vol.50, issue.13, pp.2127-2159, 2006.

O. Avner and S. Mannor, Learning to Coordinate Without Communication in Multi-User Multi-Armed Bandit Problems, 2015.

O. Avner and S. Mannor, Multi-User Lax Communications: a Multi-Armed Bandit Approach, International Conference on Computer Communications, 2016.

O. Avner and S. Mannor, Multi-User Communication Networks: A Coordinated Multi-Armed Bandit Approach, 2018.

R. Alami, O. Maillard, and R. Féraud, Memory Bandits: Towards the Switching Bandit Problem Best Resolution, Conference on Neural Information Processing Systems, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01879251

A. Anandkumar, N. Michael, and A. K. Tang, Opportunistic Spectrum Access with multiple users: Learning under competition, International Conference on Computer Communications, 2010.

A. Anandkumar, N. Michael, A. K. Tang, and S. , Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret, Journal on Selected Areas in Communications, vol.29, issue.4, pp.731-745, 2011.

P. Auer and R. Ortner, UCB Revisited: Improved Regret Bounds For The Stochastic Multi-Armed Bandit Problem, Periodica Mathematica Hungarica, vol.61, issue.1-2, pp.55-65, 2010.

V. Anantharam, P. Varaiya, and J. Walrand, Asymptotically efficient allocation rules for the Multi-Armed Bandit problem with multiple plays -Part I: IID rewards, Transactions on Automatic Control, vol.32, issue.11, pp.968-976, 1987.

V. Anantharam, P. Varaiya, and J. Walrand, Asymptotically efficient allocation rules for the Multi-Armed Bandit problem with multiple plays -Part II: Markovian rewards, Transactions on Automatic Control, vol.32, issue.11, pp.977-982, 1987.

;. G. B-+-18 and . Brandl, Sphinx: Python documentation generator, 2018.

G. A. Barnard, Control charts and stochastic processes, Journal of the Royal Statistical Society. Series B (Methodological), pp.239-271, 1959.

R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot, Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-Stationary Settings, 12th EAI Conference on Cognitive Radio Oriented Wireless Network and Communication, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01575419

L. Besson, R. Bonnefoi, and C. Moy, Multi-Arm Bandit Algorithms for Internet of Things Networks: A TestBed Implementation and Demonstration that Learning Helps, Demonstration presented at International Conference on Telecommunications, 2018.

L. Besson, R. Bonnefoi, and C. Moy, Multi-Armed bandits Learning for Internet-of-things Networks, IEEE. Following a Demonstration presented at International Conference on Telecommunications (ICT, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02006825

R. Bonnefoi, L. Besson, J. C. Manco-vasquez, and C. Moy, Upper-Confidence Bound for Channel Selection in LPWA Networks with Retransmissions, MOTIoN Workshop, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02049824

S. Behnel, R. Bradshaw, D. S. Seljebotn, G. Ewing, W. Stein et al., Cython: C-extensions for python, 2019.

S. Bubeck and N. Cesa-bianchi, Regret Analysis of Stochastic and Non-Stochastic Multi-Armed Bandit Problems. Foundations and Trends® in Machine Learning, vol.5, pp.1-122, 2012.

L. Besson, SMPyBandits: an Experimental Framework for Single and Multi-Players Multi-Arms Bandits Algorithms in Python, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01840022

L. Besson, SMPyBandits: an Open-Source Research Framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms in Python, pp.2016-2019

O. Besbes, Y. Gur, and A. Zeevi, Stochastic Multi-Armed Bandit Problem with Non-Stationary Rewards, Advances in Neural Information Processing Systems, pp.199-207, 2014.

L. Besson and E. Kaufmann, Multi-Player Bandits Revisited, Algorithmic Learning Theory, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01629733

L. Besson and E. Kaufmann, What Doubling Trick Can and Can't Do for Multi-Armed Bandits, February, 2018.

L. Besson and E. Kaufmann, Analyse non asymptotique d'un test séquentiel de détection de ruptures et application aux bandits non stationnaires. GRETSI, 2019.

L. Besson and E. Kaufmann, Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits, 2019.

L. Besson, E. Kaufmann, and C. Moy, Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access, Wireless Communications and Networking Conference, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01705292

I. Bistritz and A. Leshem, Distributed Multi-Player Bandits: a Game Of Thrones Approach, Advances in Neural Information Processing Systems, pp.7222-7232, 2018.

S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00794821

A. Baransi, O. Maillard, and S. Mannor, Sub-sampling for Multi-armed Bandits, Proceedings of the European Conference on Machine Learning, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01025651

M. Basseville and I. Nikiforov, Detection of Abrupt Changes: Theory And Application, vol.104, 1993.
URL : https://hal.archives-ouvertes.fr/hal-00008518

Q. Bodinier, Coexistence of Communication Systems Based on Enhanced Multi-Carrier Waveforms with Legacy OFDM Networks, 2017.
URL : https://hal.archives-ouvertes.fr/tel-01731022

]. I. +-16, . Bicking, and . Pypa, Virtualenv: a tool to create isolated Python environments, 2016.

E. Boursier, V. Perchet, and . Sic-mmab, Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02371008

D. Bouneffouf and I. Rish, A Survey on Practical Applications of Multi-Armed and Contextual Bandits, under review by IJCAI 2019 Survey, 2019.

S. Bubeck and A. Slivkins, The Best Of Both Worlds Stochastic And Adversarial Bandits, Conference on Learning Theory, pp.42-43, 2012.

S. Boyd and L. Vandenberghe, Convex Optimization, 2004.

M. Bande and V. V. Veeravalli, Adversarial Multi-User Bandits for Uncoordinated Spectrum Access, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4514-4518, 2019.

H. Chernoff, A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, The Annals of Mathematical Statistics, vol.23, issue.4, pp.493-507, 1952.

A. Collette, HDF5 for Python, 2018.

N. Cesa-bianchi and G. Lugosi, Prediction, Learning, and Games, 2006.

R. Corless, G. Gonnet, D. Hare, D. Jeffrey, and D. Knuth, On the Lambert W Function, Advances in Computational Mathematics, pp.329-359, 1996.

]. O. +-13, A. Cappé, O. Garivier, R. Maillard, G. Munos et al., Kullback-Leibler Upper Confidence Bounds For Optimal Sequential Allocation, Annals of Statistics, vol.41, issue.3, pp.1516-1541, 2013.

H. Chernoff, A note on an inequality involving the normal distribution. The Annals of Probability, pp.533-535, 1981.

Y. Chen, C. Lee, H. Luo, and C. Wei, A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free, Conference on Learning Theory, vol.99, pp.1-30, 2019.

R. Combes, S. Magureanu, and A. Proutiere, Minimal Exploration in Structured Stochastic Bandits, Advances in Neural Information Processing Systems, pp.1761-1769, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02395029

M. Centenaro, L. Vangelista, A. Zanella, and M. Zorzi, Long-range communications in unlicensed bands: the rising stars in the IoT and smart city scenarios, Wireless Communications, vol.23, issue.5, pp.60-67, 2016.

Y. Cao, W. Zheng, B. Kveton, and Y. Xie, Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach, International Conference on Artificial Intelligence and Statistics, 2019.

S. J. Darak and M. K. , Distributed Learning and Stable Orthogonalization in Ad-Hoc Networks with Heterogeneous Channels, 2018.

S. J. Darak, N. Modi, A. Nafkha, and C. Moy, Spectrum Utilization and Reconfiguration Cost Comparison of Various Decision Making Policies for Opportunistic Spectrum Access Using Real Radio Signals, 11th EAI Conference on Cognitive Radio Oriented Wireless Network and Communication, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01451466

S. J. Darak, C. Moy, and J. Palicot, Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, EAI Endorsed Transactions on Cognitive Communications, vol.2, pp.1-10, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01458815

S. J. Darak, A. Nafkha, C. Moy, and J. Palicot, Is Bayesian Multi Armed Bandit Algorithm Superior? Proof of Concept for Opportunistic Spectrum Access in Decentralized Networks, 11th EAI Conference on Cognitive Radio Oriented Wireless Network and Communication, 2016.
URL : https://hal.archives-ouvertes.fr/hal-02420128

R. Degenne and V. Perchet, Anytime Optimal Algorithms In Stochastic Multi Armed Bandits, International Conference on Machine Learning, pp.1587-1595, 2016.

. Ettus, . Usrp-hardware, and . Driver, , pp.2018-2027

, Python Software Foundation, 2017.

A. Garhwal and P. P. Bhattacharya, A Survey on Dynamic Spectrum Access Techniques for Cognitive Radio, International Journal of Next-Generation Networks (IJNGN), vol.3, issue.4, 2012.

G. Gautier, R. Bardenet, and M. Valko, DPPy: Sampling Determinantal Point Processes with Python, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01879424

A. Garivier and O. Cappé, The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond, Conference on Learning Theory, pp.359-376, 2011.

N. Gupta, O. Granmo-christoffer, and A. Agrawala, Thompson Sampling for Dynamic Multi Armed Bandits, International Conference on Machine Learning and Applications Workshops, pp.484-489, 2011.

A. Garivier, H. Hadiji, P. Menard, and G. Stoltz, KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01785705

A. Garivier and E. Kaufmann, Optimal Best Arm Identification with Fixed Confidence, of Conference on Learning Theory, vol.49, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01273838

A. Garivier, E. Kaufmann, and T. Lattimore, On Explore-Then-Commit Strategies, Advances in Neural Information Processing Systems, vol.29, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01322906

A. Garivier and E. Moulines, On Upper-Confidence Bound Policies For Switching Bandit Problems, Algorithmic Learning Theory, pp.174-188, 2011.

, GNU Radio Companion Documentation and Website, 2018.

E. Hazan, Introduction to online convex optimization, Foundations and Trends® in Optimization, vol.2, issue.3-4, pp.157-325, 2016.

S. Haykin, Cognitive Radio: Brain-Empowered Wireless Communications, Journal on Selected Areas in Communications, vol.23, issue.2, pp.201-220, 2005.

]. C. +-06, S. Hartland, N. Gelly, O. Baskiotis, M. Teytaud et al., Multi-Armed Bandit, Dynamic Environments and Meta-Bandits, NeurIPS 2006 Workshop, Online Trading Between Exploration And Exploitation, 2006.

W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American statistical association, vol.58, issue.301, pp.13-30, 1963.

J. Honda, A Note on KL-UCB+ Policy for the Stochastic Bandit, 2019.

J. D. Hunter, Matplotlib: a 2D Graphics Environment, Computing In Science & Engineering, vol.9, issue.3, pp.90-95, 2007.

. I-+-17]-anaconda-inc, NumPy aware dynamic Python compiler using LLVM, 2017.

W. Jouini, D. Ernst, C. Moy, and J. Palicot, Multi-Armed Bandit Based Policies for Cognitive Radio's Decision Making Issues, International Conference Signals, Circuits and Systems, 2009.

W. Jouini, D. Ernst, C. Moy, and J. Palicot, Upper Confidence Bound Based Decision Making Strategies and Dynamic Spectrum Access, International Conference on Communications, pp.1-5, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00489331

H. Joshi, R. Kumar, A. Yadav, and S. J. Darak, Distributed Algorithm for Dynamic Spectrum Access in Infrastructure-Less Cognitive Radio Network, 2018 IEEE Wireless Communications and Networking Conference (WCNC), pp.1-6, 2018.

W. Jouini, C. Moy, and J. Palicot, Decision Making for Cognitive Radio Equipment: Analysis of the First 10 Years of Exploration, EURASIP Journal on Wireless Communications and Networking, vol.2012, issue.1, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00682511

E. Jones, T. E. Oliphant, and P. Peterson, SciPy: Open source scientific tools for Python, JOP, 2001.

M. Jordan, Stat 260/CS 294, 2010.

W. Jouini, Contribution to Learning and Decision Making under Uncertainty for Cognitive Radio, 2017.
URL : https://hal.archives-ouvertes.fr/tel-00765437

T. Kluyver, Jupyter Notebooks -a publishing format for reproducible computational workflows, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pp.87-90, 2016.

R. Kerkouche, R. Alami, R. Féraud, N. Varsier, and P. Maillé, Node-based optimization of LoRa transmissions with Multi-Armed Bandit algorithms, International Conference on Telecommunications, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01946456

E. Kaufmann, O. Cappé, and A. Garivier, On Bayesian Upper Confidence Bounds for Bandit Problems, International Conference on Artificial Intelligence and Statistics, pp.592-600, 2012.
URL : https://hal.archives-ouvertes.fr/hal-02286440

]. R. Kdh-+-19, S. J. Kumar, M. K. Darak, A. K. Hanawal, R. K. Sharma et al., Distributed Algorithm for Learning to Coordinate in Infrastructure-Less Network, IEEE Communications Letters, vol.23, issue.2, pp.362-365, 2019.

R. Kumar, S. J. Darak, A. Yadav, A. K. Sharma, and R. K. Tripathi, Two-stage Decision Making Policy for Opportunistic Spectrum Access and Validation on USRP Testbed, Wireless Networks, pp.1-15, 2016.

]. R. Kdy-+-17, S. J. Kumar, A. Darak, A. K. Yadav, R. K. Sharma et al., Channel Selection for Secondary Users in Decentralized Network of Unknown Size, Communications Letters, vol.21, issue.10, pp.2186-2189, 2017.

J. Komiyama, J. Honda, and H. Nakagawa, Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-Armed Bandit Problem with Multiple Plays, International Conference on Machine Learning, vol.37, pp.1152-1161, 2015.

E. Kaufmann and W. M. Koolen, Mixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01886612

E. Kaufmann, N. Korda, and R. Munos, Thompson Sampling: an Asymptotically Optimal Finite-Time Analysis, Algorithmic Learning Theory, pp.199-213, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00830033

S. Kullback and R. A. Leibler, On Information and Sufficiency, The Annals of Mathematical Statistics, vol.22, issue.1, pp.79-86, 1951.

E. Kaufmann and A. Mehrabian, New Algorithms for Multiplayer Bandits when Arm Means Vary Among Players, 2019.

D. Kalathil, N. Nayyar, and R. Jain, Decentralized Learning for Multi-Player Multi-Armed Bandits, Conference on Decision and Control, 2012.

L. Kocsis and C. Szepesvári, Discounted UCB, 2nd PASCAL Challenges Workshop, 2006.

B. Kveton, C. Szepesvari, M. Ghavamzadeh, and C. Boutilier, Perturbed-History Exploration in Stochastic Multi-Armed Bandits, 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), 2019.

B. Kim and A. Tewari, On the Optimality of Perturbations in Stochastic and Adversarial Multi-Armed Bandit Problems, 2019.

R. Kumar, A. Yadav, S. J. Darak, and M. K. , Trekking Based Distributed Algorithm for Opportunistic Spectrum Access in Infrastructure-Less Network, 16th International Symposium on Modeling and Optimization in Mobile, Ad-Hoc, and Wireless Networks (WiOpt), pp.1-8, 2018.

T. Lattimore, Library for Multi-Armed Bandit Algorithms, 2016.

T. Lattimore, Regret Analysis Of The Finite Horizon Gittins Index Strategy For Multi Armed Bandits, Conference on Learning Theory, pp.1214-1245, 2016.

T. Lattimore, Refining the confidence level for optimistic bandit strategies, The Journal of Machine Learning Research, vol.19, issue.1, pp.765-796, 2018.

M. López-benítez, F. Casadevall, A. Umbert, J. Pérez-romero, R. Hachemani et al., Spectral Occupation Measurements and Blind Standard Recognition Sensor for Cognitive Radio Networks, 4th International Conference on Cognitive Radio Oriented Wireless Networks and Communications, pp.1-9, 2009.

L. Li, W. Chu, J. Langford, and R. E. Schapire, A Contextual-Bandit Approach to Personalized News Article Recommendation, International Conference on World Wide Web, pp.661-670, 2010.

A. Luedtke, E. Kaufmann, and A. Chambaz, Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits, Machine Learning, pp.1-31, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01338733

H. Li, J. Luo, and C. Liu, Selfish Bandit based Cognitive Anti-jamming Strategy for Aeronautic Swarm Network in Presence of Multiple Jammert, 2019.

F. Liu, J. Lee, and N. Shroff, A Change-Detection based Framework for Piecewisestationary Multi-Armed Bandit Problem, The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018), 2018.

G. Lugosi and A. Mehrabian, Multiplayer bandits without observing collision information, 2018.

T. L. Lai and H. Robbins, Asymptotically Efficient Adaptive Allocation Rules, Advances in Applied Mathematics, vol.6, issue.1, pp.4-22, 1985.

]. J. +-16, L. Louëdec, M. Rossi, A. Chevalier, J. Garivier et al., Algorithme de bandit et obsolescence : un modèle pour la recommandation, 18ème Conférence francophone sur l'Apprentissage Automatique, 2016.

T. Lattimore and C. Szepesvári, Draft of Wednesday 1st of, 2019.

D. G. Luenberger, Quasi-Convex Programming, SIAM Journal on Applied Mathematics, vol.16, issue.5, pp.1090-1095, 1968.

H. Luo, C. Wei, A. Agarwal, and J. Langford, Efficient Contextual Bandits in Nonstationary Worlds, Proceedings of the 31st Conference On Learning Theory, vol.75, pp.1739-1776, 2018.

T. L. Lai and H. Xing, Sequential change-point detection when the pre-and post-change parameters are unknown, Sequential Analysis, vol.29, issue.2, pp.162-175, 2010.

K. Liu and Q. Zhao, A Restless Bandit Formulation of Opportunistic Access: Indexablity and Index Policy, Annual Communications Society Conference on Sensor, Mesh and Ad-Hoc Communications and Networks Workshops, 2008.

K. Liu and Q. Zhao, Distributed Learning in Multi-Armed Bandit with Multiple Players, Transaction on Signal Processing, vol.58, issue.11, pp.5667-5681, 2010.

O. Maillard, Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds, Algorithmic Learning Theory, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02351665

C. Moy and L. Besson, Decentralized Spectrum Learning for IoT Wireless Networks Collision Mitigation, ISIoT workshop, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02144465

C. Moy, L. Besson, G. Delbarre, and L. Toutain, Decentralized Spectrum Learning for Radio Collision Mitigation in Ultra-Dense IoT Networks: LoRaWAN Case Study and Measurements, Machine Learning for Intelligent Wireless Communications and Networking, 2019.

P. Ménard and A. Garivier, A Minimax and Asymptotically Optimal Algorithm for Stochastic Bandits, Algorithmic Learning Theory, vol.76, pp.223-237, 2017.

]. L. +-15, N. Melián-gutiérrez, C. Modi, F. Moy, I. Bader et al., Hybrid UCB-HMM: A Machine Learning Strategy for Cognitive Radio in HF Band, IEEE Transactions on Cognitive Communications and Networking, vol.1, issue.3, pp.347-358, 2015.

J. Mitola and G. Q. Maguire, Cognitive Radio: making software radios more personal, Personal Communications, vol.6, issue.4, pp.13-18, 1999.

O. Maillard and R. Munos, Adaptive Bandits: Towards the best history-dependent strategy, International Conference on Artificial Intelligence and Statistics, pp.570-578, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00574999

J. Marinho and E. Monteiro, Cognitive Radio: Survey on Communication Protocols Spectrum Decision Issues and Future Research Directions. Wireless Networks, vol.18, pp.147-164, 2012.

J. Mourtada and O. Maillard, Efficient Tracking of a Growing Number of Experts, Algorithmic Learning Theory, vol.76, pp.1-23, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01615424

N. Modi, Machine Learning and Statistical Decision Making for Green Radio, 2017.
URL : https://hal.archives-ouvertes.fr/tel-01668536

C. Moy, Reinforcement Learning Real Experiments for Opportunistic Spectrum Access, WSR'14, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00994975

J. Mellor, J. Shapiro-;-a.-maskooki, V. Toldov, L. Clavier, V. Loscrí et al., Competition: Channel Exploration/Exploitation Based on a Thompson Sampling Approach in a Radio Cognitive Environment, International Conference on Embedded Wireless Systems and Networks, pp.442-450, 2013.

O. Naparstek and K. Cohen, Deep Multi-User Reinforcement Learning for Dynamic Spectrum Access in Multichannel Wireless Networks, GLOBECOM 2017 -2017 IEEE Global Communications Conference, pp.1-7, 2017.

J. R. Norris, Markov Chains, of Cambridge Series in Statistical and Probabilistic Mathematics, vol.2, 1998.

, OctoClock Clock Distribution Module with GPSDO -Ettus Research, pp.2018-2027

F. Pérez and B. E. Granger, IPython: a System for Interactive Scientific Computing, Computing in Science and Engineering, vol.9, issue.3, pp.21-29, 2007.

V. Patil, G. Ghalme, V. Nair, and Y. Narahari, Stochastic Multi-Armed Bandits with Arm-specific Fairness Guarantees, 2019.

K. Patil, R. Prasad, and K. Skouby, A Survey of Worldwide Spectrum Occupancy Measurement Campaigns for Cognitive Radio, 2011 International Conference on Devices and Communications (ICDeCom), pp.1-5, 2011.

V. Raj, A Julia Package for providing Multi Armed Bandit Experiments, 2017.

V. Raj and S. Kalyani, Taming Non-Stationary Bandits: a Bayesian Approach, 2017.

U. Raza, P. Kulkarni, and M. Sooriyabandara, Low Power Wide Area Networks (LP-WAN): An Overview, Communications Surveys Tutorials, vol.19, issue.2, pp.855-873, 2017.

C. Robert, C. Moy, and H. Zhang, Opportunistic Spectrum Access Learning Proof of Concept, SDR-WinnComm'14, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00994940

H. Robbins, Some Aspects of the Sequential Design of Experiments, Bulletin of the American Mathematical Society, vol.58, issue.5, pp.527-535, 1952.

L. G. Roberts, ALOHA Packet System With and Without Slots and Capture, SIGCOMM Computer Communication Review, vol.5, issue.2, pp.28-42, 1975.

J. Rosenski, O. Shamir, and L. Szlak, Multi-Player Bandits -A Musical Chairs Approach, International Conference on Machine Learning, pp.155-163, 2016.

M. Subhedar and G. Birajdar, Spectrum Sensing Techniques in Cognitive Radio Networks: a Survey, International Journal of Next-Generation Networks, vol.3, issue.2, pp.37-51, 2011.

R. S. Sutton and A. G. Barto, Reinforcement Learning: An introduction, 2018.

S. Sawant, R. Kumar, M. K. Hanawal, and S. J. Darak, Learning to Coordinate in a Decentralized Cognitive Radio Network in Presence of Jammers, 16th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), 2018.

Y. Seldin and G. Lugosi, An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, Conference on Learning Theory, vol.65, pp.1-17, 2017.

]. J. +-19, A. Seznec, A. Locatelli, A. Carpentier, M. Lazaric et al., Rotting Bandits Are No Harder Than Stochastic Ones, International Conference on Artificial Intelligence and Statistics, 2019.

A. Slivkins, Introduction to Multi-Armed Bandits, 2019.

D. Siegmund and E. S. Venkatraman, Using the Generalized Likelihood Ratio Statistic for Sequential Detection of a Change Point, The Annals of Statistics, pp.255-271, 1995.

V. Toldov, L. Clavier, V. Loscrí, and N. Mitton, A Thompson Sampling Approach to Channel Exploration Exploitation Problem in Multihop Cognitive Radio Networks, PIMRC, pp.1-6, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01355002

F. S. Truzzi, V. F. Silva, A. H. Costa, and F. Gagliardi-cozman, AdBandit: a New Algorithm for Multi-Armed Bandits, ENIAC, issue.1, 2013.

W. R. Thompson, On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples, Biometrika, vol.25, 1933.

C. Tekin and M. Liu, Online Learning in Decentralized Multi-User Spectrum Access with Synchronized Explorations, Military Communications Conference, 2012.

H. Tibrewal, S. Patchala, M. K. Hanawal, and S. J. Darak, Distributed Learning and Optimal Assignment in Multiplayer Heterogeneous Networks, IEEE Conference on Computer Communications (INFOCOM 2019), pp.1693-1701, 2019.

K. Tomer, L. Roi, and M. Yishay, Bandits with Movement Costs and Adaptive Pricing, Conference on Learning Theory, vol.65, pp.1242-1268, 2017.

C. Tao, Q. Zhang, and Y. Zhou, Collaborative Learning with Limited Interaction: Tight Bounds for Distributed Exploration in Multi-Armed Bandits, 2019.

M. Valko, Bandits on Graphs and Structures. Habilitation thesis to supervise research, 2016.
URL : https://hal.archives-ouvertes.fr/tel-01359757

G. Varoquaux, Joblib: running Python functions as pipeline jobs, 2017.

S. Van-der-walt, C. S. Colbert, and G. Varoquaux, The NumPy Array: A Structure for Efficient Numerical Computation, Computing in Science & Engineering, vol.13, issue.2, pp.22-30, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00564007

V. Valenta, R. Mar?álek, G. Baudoin, M. Villegas, M. Suarez et al., Survey on spectrum utilization in Europe: Measurements, analyses and observations, 5th EAI Conference on Cognitive Radio Oriented Wireless Network and Communication, pp.1-5, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00492021

M. Waskom, Seaborn: statistical data visualization, 2017.

A. Wald, Some Generalizations of the Theory of Cumulative Sums of Random Variables, The Annals of Mathematical Statistics, vol.16, issue.3, pp.287-293, 1945.

]. F. Wbmb-+-19, S. Wilhelmi, B. Barrachina-muñoz, C. Bellalta, A. Cano et al., Potential and pitfalls of multi-armed bandits for decentralized spatial reuse in wlans, Journal of Network and Computer Applications, vol.127, pp.26-42, 2019.

]. F. +-19, C. Wilhelmi, G. Cano, B. Neu, A. Bellalta et al., Collaborative spatial reuse in wireless networks via selfish multi-armed bandits, 2019.

Y. Wang, J. Hu, X. Chen, and L. Wang, Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret, 2019.

P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability, vol.25, pp.287-298, 1988.

S. S. Wilks, The large-sample distribution of the likelihood ratio for testing composite hypotheses, The Annals of Mathematical Statistics, vol.9, issue.1, pp.60-62, 1938.

L. Wei and V. Srivastava, On Distributed Multi-player Multi-Armed Bandit Problems in Abruptly-Changing Environment, Conference on Decision and Control, pp.5783-5788, 2018.

L. Wei and V. Srivatsva, On Abruptly-Changing And Slowly-Varying Multi-Armed Bandit Problems, American Control Conference, pp.6291-6296, 2018.

T. Yucek and H. Arslan, A Survey of Spectrum Sensing Algorithms for Cognitive Radio Applications, IEEE Communications Surveys & Tutorials, vol.11, issue.1, pp.116-130, 2009.

M. E. Yaari, A Note on Separability and Quasiconcavity, Econometrica, vol.45, issue.5, pp.1183-1186, 1977.

X. Yang, A. Fapojuwo, and E. Egbogah, Performance Analysis and Parameter Optimization of Random Access Backoff Algorithm in LTE, Vehicular Technology Conference, pp.1-5, 2012.

J. Y. Yu and S. Mannor, Piecewise-Stationary Bandit Problems with Side Observations, International Conference on Machine Learning, pp.1177-1184, 2009.

Z. Tian, J. Wang, J. Wang, and J. Song, Distributed NOMA-Based Multi-Armed Bandit Approach for Channel Access in Cognitive Radio Networks, IEEE Wireless Communications Letters, pp.1-4, 2019.

S. M. Zafaruddin, I. Bistritz, A. Leshem, and D. Niyato, Distributed Learning for Channel Allocation Over a Shared Spectrum, 20th IEEE International Workshop on SignalProcessing Advances in Wireless Communications (SPAWC), 2019.

Q. Zhao and B. M. Sadler, A Survey of Dynamic Spectrum Access, Signal Processing magazine, vol.24, issue.3, pp.79-89, 2007.

J. Zimmert and Y. Seldin, Well, my brother has his sword, and I have my mind. A mind needs books like a sword needs a whetstone, Proceedings of Machine Learning Research, vol.89, pp.467-475, 2019.

R. R. George and . Martin, A Game of Thrones