P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, vol.47, p.235256, 2002.

C. Leemon, A. H. Baird, and . Klopf, Reinforcement learning with highdimensional continuous actions, 1993.

A. G. Barto, S. J. Bradtke, and S. P. Singh, Learning to act using real-time dynamic programming, Artificial Intelligence, vol.72, issue.1-2, p.81138, 1995.
DOI : 10.1016/0004-3702(94)00011-O

R. E. Bellman, Dynamic Programming, 1957.

A. Bourki, G. Chaslot, M. Coulm, V. Danjean, H. Doghmen et al., Arpad Rimmel, Fabien Teytaud, Olivier Teytaud, Paul Vayssière, and Ziqin Yut. Scalability and parallelization of monte-carlo tree search, Computers and Games, p.4858, 2010.

C. Boutilier, T. Dean, and S. Hanks, Decision-theoretic planning : Structural assumptions and computational leverage, Journal of Articial Intelligence Research, vol.11, p.194, 1999.

J. Steven, M. O. Bradtke, and . Du, Reinforcement learning methods for continuous-time markov decision problems, Advances in Neural Information Processing Systems, p.393400, 1994.

R. I. Brafman and M. Tennenholtz, R-max a general polynomial time algorithm for near-optimal reinforcement learning, Journal of Machine Leaning Research, vol.3, p.213231, 2001.

S. Bubeck and R. Munos, Open loop optimistic planning, Conference on Learning Theory, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00943119

S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári, Online optimization in x-armed bandits, NIPS, pp.201208-139, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00329797

L. Bu³oniu, R. Munos, B. D. Schutter, and R. Babu²ka, Optimistic planning for sparsely stochastic systems, Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, p.4855, 2011.

T. Cazenave and N. Jouandeau, On the parallelization of uct, CGW 2007, p.93101, 2007.

H. Soo-chang, M. C. Fu, J. Hu, and S. I. Marcus, An Adaptive Sampling Algorithm for Solving Markov Decision Processes, Operations Research, vol.53, issue.1, p.126139, 2005.
DOI : 10.1287/opre.1040.0145

G. Chaslot, M. H. Winands, and H. , Jaap van den Herik. Parallel monte-carlo tree search, Computers and Games, p.6071, 2008.

P. Coquelin and R. Munos, Bandit algorithms for tree search, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00150207

R. Coulom, Reinforcement Learning Using Neural Networks with Applications to Motor Control, 2002.
URL : https://hal.archives-ouvertes.fr/tel-00003985

S. Davies, A. Y. Ng, and A. Moore, Applying online search techniques to continuous-state reinforcement learning, Proceedings of the Fifteenth National Conference on Articial Intelligence, p.753760, 1998.

D. Ernst, P. Geurts, L. Wehenkel, and M. L. Littman, Treebased batch mode reinforcement learning, Journal of Machine Learning Research, vol.6, p.503556, 2005.

C. Gaskett, D. Wettergreen, and A. Zelinsky, Q-Learning in Continuous State and Action Spaces, Australian Joint Conference on Articial Intelligence, p.417428, 1999.
DOI : 10.1007/3-540-46695-9_35

S. Gelly and Y. Wang, Exploration exploitation in go : Uct for monte-carlo go, Twentieth Annual Conference on Neural Information Processing Systems (NIPS), 2006.
URL : https://hal.archives-ouvertes.fr/hal-00115330

S. Gelly, J. Hoock, A. Rimmel, O. Teytaud, and Y. Kalemkarian, The parallelization of monte-carlo planning, ICINCO, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00287867

M. Ghavamzadeh and S. Mahadevan, Continuous-time hierarchical reinforcement learning, Proceedings of the Eighteenth International Conference on Machine Learning, pp.186193-140, 2001.

R. Hafner and M. Riedmiller, Reinforcement learning in feedback control -challenges and benchmarks from technical process control, Machine Learning, p.137169, 2011.

E. Peter, N. J. Hart, B. Nilsson, and . Raphael, A formal basis for the heuristic determination of minimum cost paths, Systems Science and Cybernetics, p.100107, 1968.

R. A. Howard, Dynamic Programming and Markov Processes, 1960.

J. Hren and R. Munos, Optimistic Planning of Deterministic Systems, p.151164, 2008.
DOI : 10.1007/978-3-540-89722-4_12

URL : https://hal.archives-ouvertes.fr/hal-00830182

T. Jaksch, R. Ortner, and P. Auer, Near-optimal regret bounds for reinforcement learning, J. Mach. Learn. Res, vol.99, p.15631600, 2010.

D. R. Jones, C. D. Perttunen, and B. E. Stuckman, Lipschitzian optimization without the Lipschitz constant, Journal of Optimization Theory and Applications, vol.20, issue.1, p.157181, 1993.
DOI : 10.1007/BF00941892

M. Kearns, Y. Mansour, and A. Y. Ng, A sparse sampling algorithm for near-optimal planning in large markov decision processes

L. Kocsis and C. Szepesvári, Bandit Based Monte-Carlo Planning, ECML, p.282293, 2006.
DOI : 10.1007/11871842_29

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

S. Koenig and M. Likhachev, Real-time adaptive A*, Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems , AAMAS '06, p.281288, 2006.
DOI : 10.1145/1160633.1160682

S. Koenig, M. Likhache, Y. Liu, and D. Furcy, Incremental heuristic search in ai, Articial Intelligence Magazine, vol.25, issue.2, p.99112, 2004.

R. E. Korf, Depth-rst iterative-deepening : An optimal admissible tree search, Artif. Intell, vol.27, issue.1, p.97109, 1985.

T. Leung, L. , and H. E. Robbins, Asymptotically ecient adaptive allocation rules, Advances in Applied Mathematics, vol.6, p.422, 1985.

A. Lazaric, M. Restelli, and A. Bonarini, Reinforcement learning in continuous action spaces through sequential monte carlo methods, NIPS, p.141, 2007.

G. Bibliographie-maxim-likhachev, S. Gordon, and . Thrun, ARA* : Anytime A* with provable bounds on sub-optimality, Advances in Neural Information Processing Systems, 2004.

F. Maes, L. Wehenkel, and D. Ernst, Optimized Look-ahead Tree Search Policies, European Workshop on Reinforcement Learning (EWRL'9), 2011.
DOI : 10.1007/978-3-642-29946-9_20

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

C. R. Mansley, A. Weinstein, and M. L. Littman, Samplebased planning for continuous action markov decision processes, ICAPS, 2011.

N. Meuleau and P. Bourgine, Exploration of multi-state environments : Local measures and back-propagation of uncertainty, Machine Learning, p.117154, 1999.

J. Del, R. Millán, D. Posenato, and E. Dedieu, Continuous-action q-learning, Mach. Learn, vol.49, issue.2-3, p.247265, 2002.

R. Munos, Optimistic optimization of deterministic functions without the knowledge of its smoothness, Advances in Neural Information Processing Systems, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00830143

G. Neumann, M. Pfeier, and W. Maass, Ecient continuous-time reinforcement learning with adaptive state graphs, Proceedings of the 18th European conference on Machine Learning, ECML '07, p.250261, 2007.

J. Pazis and M. G. Lagoudakis, Binary action search for learning continuous-action control policies, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, p.100, 2009.
DOI : 10.1145/1553374.1553476

J. Pazis and M. G. Lagoudakis, Learning continuous-action control policies, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, p.169176, 2009.
DOI : 10.1109/ADPRL.2009.4927541

L. Péret and F. Garcia, On-line search for solving markov decision processes via heuristic sampling, ECAI, p.530534, 2004.

L. Martin and . Putterman, Markov Decision Processes : Discrete Stochastic Dynamic Programming, 1994.

B. Sallans, G. E. Hinton, and S. Mahadevan, Reinforcement learning with factored states and actions, Journal of Machine Learning Research, vol.5, p.10631088, 2004.

A. Stentz, Optimal and ecient path planning for partially-known environments, Proceedings of the IEEE International Conference on Robotics and Automation (ICRA '94), pp.33103317-142, 1994.

A. Stentz, The focussed d* algorithm for real-time replanning, Proceedings of the International Joint Conference on Articial Intelligence, p.16521659, 1995.

R. S. Sutton, Dyna, an integrated architecture for learning, planning, and reacting, ACM SIGART Bulletin, vol.2, issue.4, p.160163, 1991.
DOI : 10.1145/122344.122377

I. Szita and A. Lörincz, The many faces of optimism, Proceedings of the 25th international conference on Machine learning, ICML '08, p.10481055, 2008.
DOI : 10.1145/1390156.1390288

G. Tesauro and G. R. Galperin, On-line policy improvement using monte-carlo search, Neural Information Processing Systems, p.10681074, 1996.

C. Watkins, Learning from Delayed Rewards, 1989.