, to finally understand whether it is possible to scale with sp (h * ) (at least in communicating MDPs) instead of ? ? sp (h * ) (the flaw in Regal, 2010.

, we will show that achieving a regret scaling with sp (h * ) instead of ? is at least possible when the value sp (h * ) is known and given as input to the learning algorithm. 5.8. scal * : scal with tighter optimism

Y. Abbasi-yadkori and C. Szepesvári, Bayesian optimal control of smoothly parameterized systems, UAI, p.42, 2015.

Y. Abbasi-yadkori and C. Szepesvári, Bayesian optimal control of smoothly parameterized systems, Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI'15, p.93, 2015.

S. Agrawal and R. Jia, Optimistic posterior sampling for reinforcement learning: worst-case regret bounds, NIPS, vol.42, p.87, 2017.

J. Audibert, R. Munos, and C. Szepesvári, Tuning bandit algorithms in stochastic environments, Algorithmic Learning Theory, p.52, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00203487

J. Audibert, R. Munos, and C. Szepesvári, Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Theor. Comput. Sci, vol.410, issue.19, pp.1876-1902, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00711069

P. Auer, N. Cesa-bianchi, Y. Freund, and R. E. Schapire, Gambling in a rigged casino: The adversarial multi-armed bandit problem, Proceedings of IEEE 36th Annual Foundations of Computer Science, vol.92, pp.322-331, 1995.

P. Auer and R. Ortner, Logarithmic online regret bounds for undiscounted reinforcement learning, Advances in Neural Information Processing Systems, vol.19, p.43, 2007.

M. G. Azar, I. Osband, and R. Munos, Minimax regret bounds for reinforcement learning, Proceedings of the 34th International Conference on Machine Learning, vol.70, p.121, 2017.

P. Bacon and D. Precup, The option-critic architecture, NIPS'15 Deep Reinforcement Learning Workshop, p.157, 2015.

P. L. Bartlett and A. Tewari, REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs, UAI, vol.17, p.119, 2009.

M. G. Bibliography-bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton et al., Unifying count-based exploration and intrinsic motivation, NIPS, vol.120, pp.1471-1479, 2016.

R. Bellman, The theory of dynamic programming, Bull. Amer. Math. Soc, vol.60, issue.6, p.17, 1954.

D. P. Bertsekas, Dynamic programming and optimal control, vol.34, p.127, 1995.

D. P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific, vol.II, 2007.

S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, OUP Oxford, vol.52, p.53, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00794821

P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, vol.21, p.184, 1999.

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman et al., OpenAI Gym, 2016.

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman et al., Openai gym. CoRR, 2016.

E. Brunskill and L. Li, PAC-inspired Option Discovery in Lifelong Reinforcement Learning, Proceedings of the 31st International Conference on Machine Learning, vol.32, p.217, 2014.

A. N. Burnetas and M. N. Katehakis, Optimal adaptive policies for markov decision processes, Mathematics of Operations Research, vol.22, issue.1, p.47, 1997.

P. S. Castro and D. Precup, Automatic construction of temporally extended actions for mdps using bisimulation metrics, Proceedings of the 9th European Conference on Recent Advances in Reinforcement Learning, EWRL'11, vol.157, pp.140-152, 2012.

N. Cesa-bianchi and G. Lugosi, Prediction, Learning, and Games, p.92, 2006.

O. Sim?ek and A. G. Barto, Using relative novelty to identify useful temporal abstractions in reinforcement learning, Proceedings of the Twenty-first International Conference on Machine Learning, 2004.

F. Dai, M. Walter, H. Wallach, H. Larochelle, A. Beygelzimer et al., Maximum expected hitting cost of a markov decision process and informativeness of rewards, Advances in Neural Information Processing Systems, vol.32, p.64, 2019.

C. Dann and E. Brunskill, Sample complexity of episodic fixed-horizon reinforcement learning, Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 15, vol.53, p.70, 2015.

C. Bibliography-dann, T. Lattimore, and E. Brunskill, Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning, NIPS, vol.59, pp.5717-5727, 2017.

T. G. Dietterich, Hierarchical reinforcement learning with the maxq value function decomposition, Journal of Artificial Intelligence Research, vol.13, pp.227-303, 0198.

S. Filippi, O. Cappé, and A. Garivier, Optimism in Reinforcement Learning and Kullback-Leibler Divergence. This work has been accepted and presented at ALLERTON 2010, p.48, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00476116

D. A. Freedman, On tail probabilities for martingales, Ann. Probab, vol.3, issue.1, pp.100-118, 1975.

R. Fruit and A. Lazaric, Exploration-Exploitation in MDPs with Options, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol.54, p.197, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01493567

R. Fruit, M. Pirotta, A. Lazaric, S. Bengio, H. Wallach et al., Near optimal exploration-exploitation in non-communicating markov decision processes, Advances in Neural Information Processing Systems, vol.31, p.93, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01941220

R. Fruit, M. Pirotta, A. Lazaric, and E. Brunskill, Regret minimization in mdps with options without prior knowledge, NIPS, vol.158, pp.3169-3179, 0198.
URL : https://hal.archives-ouvertes.fr/hal-01649082

R. Fruit, M. Pirotta, A. Lazaric, and R. Ortner, Efficient bias-span-constrained exploration-exploitation in reinforcement learning, Proceedings of the 35th International Conference on Machine Learning, vol.121, p.213, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01941206

A. Garivier, P. Ménard, and G. Stoltz, Explore first, exploit next: The true shape of regret in bandit problems, Mathematics of Operations Research, p.41, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01276324

R. Givan, S. Leach, and T. Dean, Bounded-parameter markov decision processes, Artificial Intelligence, vol.122, issue.1, pp.71-109, 2000.

A. Gopalan and S. Mannor, Thompson sampling for learning parameterized markov decision processes, COLT, volume 40 of JMLR Workshop and Conference Proceedings, pp.861-898, 2015.

C. M. Grinstead and J. L. Snell, , vol.21, p.164, 2003.

T. Jaksch, R. Ortner, and P. Auer, Near-optimal regret bounds for reinforcement learning, Journal of Machine Learning Research, vol.11, p.207, 2010.

C. Jin, Z. Allen-zhu, S. Bubeck, J. , and M. I. , Is q-learning provably efficient? CoRR, vol.120, p.121, 2018.

B. Jong, N. K. Hester, T. Stone, and P. , The utility of temporal abstraction in reinforcement learning, The Seventh International Joint Conference on Autonomous Agents and Multiagent Systems, p.158, 2008.

S. Kakade, M. Wang, Y. , and L. F. , Variance Reduction Methods for Sublinear Reinforcement Learning, vol.79, p.86, 2018.

S. Kakade, M. Wang, Y. , and L. F. , Variance reduction methods for sublinear reinforcement learning, vol.120, p.121, 2018.

E. Kaufmann, O. Cappe, and A. Garivier, On bayesian upper confidence bounds for bandit problems, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, vol.22, pp.592-600, 2012.
URL : https://hal.archives-ouvertes.fr/hal-02286440

S. J. Kirkland, M. Neumann, and N. Sze, On optimal condition numbers for markov chains, Numerische Mathematik, vol.110, issue.4, pp.521-537, 2008.

T. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Adv. Appl. Math, vol.6, issue.1, pp.4-22, 1985.

K. Lakshmanan, R. Ortner, and D. Ryabko, Improved regret bounds for undiscounted continuous reinforcement learning, Proceedings of the 32nd International Conference on Machine Learning, vol.37, pp.524-532, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01165966

T. Lattimore and M. Hutter, Pac bounds for discounted mdps, Proc. 23rd International Conf. on Algorithmic Learning Theory (ALT'12), vol.7568, p.79, 2012.

T. Lattimore and M. Hutter, Near-optimal pac bounds for discounted mdps, Theoretical Computer Science, vol.558, p.86, 2014.

T. Lattimore and C. Szepesvári, Bandit algorithms. Pre-publication version, vol.54, p.59, 2018.

K. Y. Levy and N. Shimkin, Unified inter and intra options learning using policy gradient methods, Lecture Notes in Computer Science, vol.7188, p.157, 2011.

M. Lewis, H. Ayhan, D. Foley, and R. , Bias optimality in a queue with admission control, Probability in the Engineering and Informational Sciences, vol.13, p.38, 1999.

M. E. Lewis and M. L. Puterman, Bias Optimality, vol.31, issue.2, p.38, 2002.

O. Maillard, T. A. Mann, S. ;. Mannor, M. Welling, C. Cortes et al., How hard is my mdp?" the distributionnorm to the rescue, Advances in Neural Information Processing Systems, vol.27, p.79, 2014.

T. A. Bibliography-mann, D. J. Mankowitz, and S. Mannor, Time-regularized interrupting options (TRIO), Proceedings of the 31th International Conference on Machine Learning, vol.32, pp.1350-1358, 2014.

T. A. Mann and S. Mannor, Scaling up approximate value iteration with options: Better policies with fewer iterations, Proceedings of the 31th International Conference on Machine Learning, vol.32, p.158, 2014.

J. Martin, S. N. Sasikumar, T. Everitt, and M. Hutter, Count-based exploration in feature space for reinforcement learning, 2017.

A. Maurer and M. Pontil, Empirical bernstein bounds and sample-variance penalization, 2009.

A. Mcgovern and A. G. Barto, Automatic discovery of subgoals in reinforcement learning using diverse density, Proceedings of the Eighteenth International Conference on Machine Learning, vol.157, pp.361-368, 2001.

I. Menache, S. Mannor, and N. Shimkin, Q-cut-dynamic discovery of sub-goals in reinforcement learning, Proceedings of the 13th European Conference on Machine Learning, pp.295-306, 2002.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness et al., Human-level control through deep reinforcement learning, Nature, vol.518, issue.7540, 2015.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness et al., , 2015.

, Human-level control through deep reinforcement learning, Nature, vol.518, issue.7540, pp.529-533

A. W. Moore, Efficient memory-based learning for robot control, vol.91, p.92, 1990.

R. Munos and A. Moore, Influence and variance of a markov chain: Application to adaptive discretization in optimal control, Proceedings: International Astronomical Union Transactions, v. 16B p, vol.79, pp.355-362, 1999.

B. F. Nielsen, Lecture notes on phase-type distributions for stochastic processes, p.166, 2012.

J. Ok, A. Proutière, and D. Tranos, Exploration in structured reinforcement learning, NeurIPS, vol.39, p.47, 2018.

R. Ortner, Optimism in the face of uncertainty should be refutable. Minds and Machines, vol.18, p.141, 2008.

R. Bibliography-ortner, Online regret bounds for markov decision processes with deterministic transitions, Theor. Comput. Sci, vol.411, pp.2684-2695, 2010.

R. Ortner, Some open problems for average reward mdps, European Workshop on Reinforcement Learning, 2016.

R. Ortner, Regret Bounds for Reinforcement Learning via Markov Chain Concentration, vol.48, p.87, 2018.

R. Ortner and D. Ryabko, Online regret bounds for undiscounted continuous reinforcement learning, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00765441

I. Osband and B. V. Roy, Posterior sampling for reinforcement learning without episodes, 2016.

I. Osband and B. V. Roy, Why is posterior sampling better than optimism for reinforcement learning, ICML, volume 70 of Proceedings of Machine Learning Research, p.87, 2017.

I. Osband, D. Russo, and B. V. Roy, (more) efficient reinforcement learning via posterior sampling, NIPS, vol.42, p.93, 2013.

I. Osband and B. Van-roy, On Lower Bounds for Regret in Reinforcement Learning. arXiv e-prints, 2016.

G. Ostrovski, M. G. Bellemare, A. Van-den-oord, and R. Munos, Count-based exploration with neural density models, ICML, vol.70, pp.2721-2730, 2017.

Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain, Learning unknown markov decision processes: A thompson sampling approach, NIPS, vol.42, p.206, 2017.

Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain, Learning unknown markov decision processes: A thompson sampling approach, Advances in Neural Information Processing Systems, vol.30, p.93, 2017.

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, vol.212, p.213, 1994.

J. Qian, R. Fruit, M. Pirotta, and A. Lazaric, Concentration inequalities for multinoulli random variables, 2018.

J. Qian, R. Fruit, M. Pirotta, and A. Lazaric, Exploration bonus for regret minimization in undiscounted discrete and continuous markov decision processes, vol.148, p.155, 2018.

M. Bibliography-sairamesh and B. Ravindran, Options with exceptions, Proceedings of the 9th, 2012.

, European Conference on Recent Advances in Reinforcement Learning, EWRL'11, vol.157, pp.165-176

P. J. Schweitzer, On undiscounted markovian decision processes with compact action spaces, vol.19, p.161, 1985.

P. J. Schweitzer and A. Federgruen, Geometric convergence of value-iteration in multichain markov decision problems, Advances in Applied Probability, vol.11, issue.1, pp.188-217, 1979.

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre et al., Mastering the game of Go with deep neural networks and tree search, Nature, vol.529, issue.7587, pp.484-489, 2016.

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai et al., Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, 2017.

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang et al., Mastering the game of go without human knowledge, Nature, vol.550, pp.354-368, 2017.

J. Sorg and S. P. Singh, Linear Options, AAMAS, vol.157, pp.31-38, 2010.

M. Stolle and D. Precup, Learning options in reinforcement learning, SARA, vol.2371, p.157, 2002.

A. L. Strehl and M. L. Littman, An analysis of model-based interval estimation for markov decision processes, J. Comput. Syst. Sci, vol.74, pp.1309-1331, 2008.

A. L. Strehl and M. L. Littman, An analysis of model-based interval estimation for markov decision processes, Journal of Computer and System Sciences, vol.74, issue.8, pp.1309-1331, 2008.

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Adaptive computation and machine learning, p.116, 2018.

R. S. Sutton, D. Precup, and S. Singh, Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning, Artificial Intelligence, vol.112, issue.1, p.197, 1999.

M. S. Talebi and O. Maillard, Variance-aware regret bounds for undiscounted reinforcement learning in mdps, ALT, vol.83, pp.770-805, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01737142

M. S. Talebi and O. Maillard, Variance-aware regret bounds for undiscounted reinforcement learning in mdps, Bibliography Proceedings of Algorithmic Learning Theory, vol.83, pp.770-805, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01737142

H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen et al., #exploration: A study of count-based exploration for deep reinforcement learning, NIPS, vol.120, pp.2750-2759, 2017.

C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor, A deep hierarchical approach to lifelong learning in minecraft, 2016.

A. Tewari and P. L. Bartlett, Bounded parameter markov decision processes with average reward criterion, Learning Theory, p.35, 2007.

A. Tewari and P. L. Bartlett, Optimistic linear programming gives logarithmic regret for irreducible mdps, Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS'07, p.43, 2007.

G. Theocharous, Z. Wen, Y. Abbasi-yadkori, and N. Vlassis, Posterior sampling for large scale reinforcement learning, 2017.

W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol.25, issue.3/4, pp.285-294, 1933.

W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol.25, issue.3-4, pp.285-294, 1933.

A. Tossou, D. Basu, and C. Dimitrakakis, Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities. arXiv e-prints, vol.48, p.87, 2019.

V. Neumann, J. Morgenstern, and O. , Theory of games and economic behavior, vol.13, p.22, 1947.

M. Wainwright, Basic tail and concentration bounds, Course on Mathematical Statistics, vol.2, p.177, 2015.

T. Weissman, E. Ordentlich, G. Seroussi, S. Verdú, and M. J. Weinberger, Inequalities for the l 1 deviation of the empirical distribution, vol.52, p.53, 2003.