, to finally understand whether it is possible to scale with sp (h * ) (at least in communicating MDPs) instead of ? ? sp (h * ) (the flaw in Regal, 2010.
, we will show that achieving a regret scaling with sp (h * ) instead of ? is at least possible when the value sp (h * ) is known and given as input to the learning algorithm. 5.8. scal * : scal with tighter optimism
Bayesian optimal control of smoothly parameterized systems, UAI, p.42, 2015. ,
Bayesian optimal control of smoothly parameterized systems, Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI'15, p.93, 2015. ,
Optimistic posterior sampling for reinforcement learning: worst-case regret bounds, NIPS, vol.42, p.87, 2017. ,
Tuning bandit algorithms in stochastic environments, Algorithmic Learning Theory, p.52, 2007. ,
URL : https://hal.archives-ouvertes.fr/inria-00203487
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Theor. Comput. Sci, vol.410, issue.19, pp.1876-1902, 2009. ,
URL : https://hal.archives-ouvertes.fr/hal-00711069
Gambling in a rigged casino: The adversarial multi-armed bandit problem, Proceedings of IEEE 36th Annual Foundations of Computer Science, vol.92, pp.322-331, 1995. ,
Logarithmic online regret bounds for undiscounted reinforcement learning, Advances in Neural Information Processing Systems, vol.19, p.43, 2007. ,
Minimax regret bounds for reinforcement learning, Proceedings of the 34th International Conference on Machine Learning, vol.70, p.121, 2017. ,
The option-critic architecture, NIPS'15 Deep Reinforcement Learning Workshop, p.157, 2015. ,
REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs, UAI, vol.17, p.119, 2009. ,
Unifying count-based exploration and intrinsic motivation, NIPS, vol.120, pp.1471-1479, 2016. ,
The theory of dynamic programming, Bull. Amer. Math. Soc, vol.60, issue.6, p.17, 1954. ,
Dynamic programming and optimal control, vol.34, p.127, 1995. ,
Dynamic Programming and Optimal Control, Athena Scientific, vol.II, 2007. ,
Concentration Inequalities: A Nonasymptotic Theory of Independence, OUP Oxford, vol.52, p.53, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00794821
Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, vol.21, p.184, 1999. ,
, OpenAI Gym, 2016.
, Openai gym. CoRR, 2016.
PAC-inspired Option Discovery in Lifelong Reinforcement Learning, Proceedings of the 31st International Conference on Machine Learning, vol.32, p.217, 2014. ,
Optimal adaptive policies for markov decision processes, Mathematics of Operations Research, vol.22, issue.1, p.47, 1997. ,
Automatic construction of temporally extended actions for mdps using bisimulation metrics, Proceedings of the 9th European Conference on Recent Advances in Reinforcement Learning, EWRL'11, vol.157, pp.140-152, 2012. ,
, Prediction, Learning, and Games, p.92, 2006.
Using relative novelty to identify useful temporal abstractions in reinforcement learning, Proceedings of the Twenty-first International Conference on Machine Learning, 2004. ,
Maximum expected hitting cost of a markov decision process and informativeness of rewards, Advances in Neural Information Processing Systems, vol.32, p.64, 2019. ,
Sample complexity of episodic fixed-horizon reinforcement learning, Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 15, vol.53, p.70, 2015. ,
Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning, NIPS, vol.59, pp.5717-5727, 2017. ,
Hierarchical reinforcement learning with the maxq value function decomposition, Journal of Artificial Intelligence Research, vol.13, pp.227-303, 0198. ,
Optimism in Reinforcement Learning and Kullback-Leibler Divergence. This work has been accepted and presented at ALLERTON 2010, p.48, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00476116
On tail probabilities for martingales, Ann. Probab, vol.3, issue.1, pp.100-118, 1975. ,
Exploration-Exploitation in MDPs with Options, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol.54, p.197, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01493567
Near optimal exploration-exploitation in non-communicating markov decision processes, Advances in Neural Information Processing Systems, vol.31, p.93, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01941220
Regret minimization in mdps with options without prior knowledge, NIPS, vol.158, pp.3169-3179, 0198. ,
URL : https://hal.archives-ouvertes.fr/hal-01649082
Efficient bias-span-constrained exploration-exploitation in reinforcement learning, Proceedings of the 35th International Conference on Machine Learning, vol.121, p.213, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01941206
Explore first, exploit next: The true shape of regret in bandit problems, Mathematics of Operations Research, p.41, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01276324
Bounded-parameter markov decision processes, Artificial Intelligence, vol.122, issue.1, pp.71-109, 2000. ,
Thompson sampling for learning parameterized markov decision processes, COLT, volume 40 of JMLR Workshop and Conference Proceedings, pp.861-898, 2015. ,
, , vol.21, p.164, 2003.
Near-optimal regret bounds for reinforcement learning, Journal of Machine Learning Research, vol.11, p.207, 2010. ,
Is q-learning provably efficient? CoRR, vol.120, p.121, 2018. ,
The utility of temporal abstraction in reinforcement learning, The Seventh International Joint Conference on Autonomous Agents and Multiagent Systems, p.158, 2008. ,
Variance Reduction Methods for Sublinear Reinforcement Learning, vol.79, p.86, 2018. ,
Variance reduction methods for sublinear reinforcement learning, vol.120, p.121, 2018. ,
On bayesian upper confidence bounds for bandit problems, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, vol.22, pp.592-600, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-02286440
On optimal condition numbers for markov chains, Numerische Mathematik, vol.110, issue.4, pp.521-537, 2008. ,
Asymptotically efficient adaptive allocation rules, Adv. Appl. Math, vol.6, issue.1, pp.4-22, 1985. ,
Improved regret bounds for undiscounted continuous reinforcement learning, Proceedings of the 32nd International Conference on Machine Learning, vol.37, pp.524-532, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01165966
Pac bounds for discounted mdps, Proc. 23rd International Conf. on Algorithmic Learning Theory (ALT'12), vol.7568, p.79, 2012. ,
Near-optimal pac bounds for discounted mdps, Theoretical Computer Science, vol.558, p.86, 2014. ,
Bandit algorithms. Pre-publication version, vol.54, p.59, 2018. ,
Unified inter and intra options learning using policy gradient methods, Lecture Notes in Computer Science, vol.7188, p.157, 2011. ,
Bias optimality in a queue with admission control, Probability in the Engineering and Informational Sciences, vol.13, p.38, 1999. ,
, Bias Optimality, vol.31, issue.2, p.38, 2002.
How hard is my mdp?" the distributionnorm to the rescue, Advances in Neural Information Processing Systems, vol.27, p.79, 2014. ,
Time-regularized interrupting options (TRIO), Proceedings of the 31th International Conference on Machine Learning, vol.32, pp.1350-1358, 2014. ,
Scaling up approximate value iteration with options: Better policies with fewer iterations, Proceedings of the 31th International Conference on Machine Learning, vol.32, p.158, 2014. ,
Count-based exploration in feature space for reinforcement learning, 2017. ,
Empirical bernstein bounds and sample-variance penalization, 2009. ,
Automatic discovery of subgoals in reinforcement learning using diverse density, Proceedings of the Eighteenth International Conference on Machine Learning, vol.157, pp.361-368, 2001. ,
Q-cut-dynamic discovery of sub-goals in reinforcement learning, Proceedings of the 13th European Conference on Machine Learning, pp.295-306, 2002. ,
Human-level control through deep reinforcement learning, Nature, vol.518, issue.7540, 2015. ,
, , 2015.
, Human-level control through deep reinforcement learning, Nature, vol.518, issue.7540, pp.529-533
Efficient memory-based learning for robot control, vol.91, p.92, 1990. ,
Influence and variance of a markov chain: Application to adaptive discretization in optimal control, Proceedings: International Astronomical Union Transactions, v. 16B p, vol.79, pp.355-362, 1999. ,
Lecture notes on phase-type distributions for stochastic processes, p.166, 2012. ,
Exploration in structured reinforcement learning, NeurIPS, vol.39, p.47, 2018. ,
Optimism in the face of uncertainty should be refutable. Minds and Machines, vol.18, p.141, 2008. ,
Online regret bounds for markov decision processes with deterministic transitions, Theor. Comput. Sci, vol.411, pp.2684-2695, 2010. ,
Some open problems for average reward mdps, European Workshop on Reinforcement Learning, 2016. ,
Regret Bounds for Reinforcement Learning via Markov Chain Concentration, vol.48, p.87, 2018. ,
Online regret bounds for undiscounted continuous reinforcement learning, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00765441
Posterior sampling for reinforcement learning without episodes, 2016. ,
Why is posterior sampling better than optimism for reinforcement learning, ICML, volume 70 of Proceedings of Machine Learning Research, p.87, 2017. ,
(more) efficient reinforcement learning via posterior sampling, NIPS, vol.42, p.93, 2013. ,
, On Lower Bounds for Regret in Reinforcement Learning. arXiv e-prints, 2016.
Count-based exploration with neural density models, ICML, vol.70, pp.2721-2730, 2017. ,
Learning unknown markov decision processes: A thompson sampling approach, NIPS, vol.42, p.206, 2017. ,
Learning unknown markov decision processes: A thompson sampling approach, Advances in Neural Information Processing Systems, vol.30, p.93, 2017. ,
Markov Decision Processes: Discrete Stochastic Dynamic Programming, vol.212, p.213, 1994. ,
Concentration inequalities for multinoulli random variables, 2018. ,
Exploration bonus for regret minimization in undiscounted discrete and continuous markov decision processes, vol.148, p.155, 2018. ,
Options with exceptions, Proceedings of the 9th, 2012. ,
, European Conference on Recent Advances in Reinforcement Learning, EWRL'11, vol.157, pp.165-176
On undiscounted markovian decision processes with compact action spaces, vol.19, p.161, 1985. ,
Geometric convergence of value-iteration in multichain markov decision problems, Advances in Applied Probability, vol.11, issue.1, pp.188-217, 1979. ,
Mastering the game of Go with deep neural networks and tree search, Nature, vol.529, issue.7587, pp.484-489, 2016. ,
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, 2017. ,
Mastering the game of go without human knowledge, Nature, vol.550, pp.354-368, 2017. ,
Linear Options, AAMAS, vol.157, pp.31-38, 2010. ,
Learning options in reinforcement learning, SARA, vol.2371, p.157, 2002. ,
An analysis of model-based interval estimation for markov decision processes, J. Comput. Syst. Sci, vol.74, pp.1309-1331, 2008. ,
An analysis of model-based interval estimation for markov decision processes, Journal of Computer and System Sciences, vol.74, issue.8, pp.1309-1331, 2008. ,
Reinforcement learning: An introduction. Adaptive computation and machine learning, p.116, 2018. ,
Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning, Artificial Intelligence, vol.112, issue.1, p.197, 1999. ,
Variance-aware regret bounds for undiscounted reinforcement learning in mdps, ALT, vol.83, pp.770-805, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01737142
Variance-aware regret bounds for undiscounted reinforcement learning in mdps, Bibliography Proceedings of Algorithmic Learning Theory, vol.83, pp.770-805, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01737142
#exploration: A study of count-based exploration for deep reinforcement learning, NIPS, vol.120, pp.2750-2759, 2017. ,
A deep hierarchical approach to lifelong learning in minecraft, 2016. ,
Bounded parameter markov decision processes with average reward criterion, Learning Theory, p.35, 2007. ,
Optimistic linear programming gives logarithmic regret for irreducible mdps, Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS'07, p.43, 2007. ,
Posterior sampling for large scale reinforcement learning, 2017. ,
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol.25, issue.3/4, pp.285-294, 1933. ,
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol.25, issue.3-4, pp.285-294, 1933. ,
, Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities. arXiv e-prints, vol.48, p.87, 2019.
Theory of games and economic behavior, vol.13, p.22, 1947. ,
Basic tail and concentration bounds, Course on Mathematical Statistics, vol.2, p.177, 2015. ,
Inequalities for the l 1 deviation of the empirical distribution, vol.52, p.53, 2003. ,