B. Arneson, R. Hayward, and P. Henderson, Mohex wins hex tournament, ICGA journal, pp.114-116, 2009.
DOI : 10.3233/icg-2009-32218

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

K. J. Astrom, Optimal control of Markov processes with incomplete state information, Journal of Mathematical Analysis and Applications, vol.10, issue.1, pp.174-205, 1965.
DOI : 10.1016/0022-247X(65)90154-X

J. Audibert and S. Bubeck, Minimax policies for adversarial and stochastic bandits, proceedings of the Annual Conference on Learning Theory (COLT), 2009.
URL : https://hal.archives-ouvertes.fr/hal-00834882

J. Audibert, R. Munos, and C. Szepesvari, Use of variance estimation in the multi-armed bandit problem, NIPS 2006 Workshop on On-line Trading of Exploration and Exploitation, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00203496

J. Audibert and S. Bubeck, Best Arm Identification in Multi- Armed Bandits, COLT 2010 -Proceedings, page 13 p, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00654404

P. Audouard, G. Chaslot, J. Hoock, J. Perez, A. Rimmel et al., Grid Coevolution for Adaptive Simulations: Application to the Building of Opening Books in the Game of Go, Proceedings of EvoGames, pp.323-332, 2009.
DOI : 10.1007/978-3-642-01129-0_36

URL : https://hal.archives-ouvertes.fr/inria-00369783

P. Auer, Using confidence bounds for exploitation-exploration trade-offs, The Journal of Machine Learning Research, vol.3, pp.397-422, 2003.

P. Auer, P. Nicoì-o-cesa-bianchi, and . Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

P. Auer, Y. Nicoì-o-cesa-bianchi, R. E. Freund, and . Schapire, Gambling in a rigged casino: The adversarial multi-armed bandit problem, Proceedings of IEEE 36th Annual Foundations of Computer Science, pp.322-331, 1995.
DOI : 10.1109/SFCS.1995.492488

P. Auer and M. Long, Using confidence bounds for exploitation-exploration trade-offs, Journal of Machine Learning Research, vol.3, 2002.

P. Auer, R. Ortner, and C. Szepesvári, Improved Rates for the Stochastic Continuum-Armed Bandit Problem, Lecture Notes in Computer Science, vol.4539, pp.454-468, 2007.
DOI : 10.1007/978-3-540-72927-3_33

D. Auger, A. Couetoux, and O. Teytaud, Continuous Upper Confidence Trees with Polynomial Exploration ??? Consistency, ECML/PKKD 2013, 2013.
DOI : 10.1007/978-3-642-40988-2_13

URL : https://hal.archives-ouvertes.fr/hal-00835352

W. Bruce and . Ballard, The *-minimax search procedure for trees containing chance nodes, Artif. Intell, vol.21, issue.3, pp.327-350, 1983.

D. Barash, A genetic search in policy space for solving markov decision processes, AAAI Spring Symposium on Search Techniques for Problem Solvingunder Uncertainty and Incomplete Information, 1999.

A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuronlike adaptive elements that can solve difficult learning control problems. Systems, Man and Cybernetics, IEEE Transactions, issue.5, pp.13834-846, 1983.
DOI : 10.1109/tsmc.1983.6313077

R. Bellman, Dynamic Programming, 1957.

J. F. Benders, Partitioning procedures for solving mixed-variables programming problems, Numerische Mathematik, vol.38, issue.1, pp.238-252, 1962.
DOI : 10.1007/BF01386316

Y. Bengio, Using a Financial Training Criterion Rather than a Prediction Criterion, CIRANO Working Papers 98s-21, 1998.
DOI : 10.1142/S0129065797000422

D. P. Bertsekas, Dynamic Programming and Optimal Control, vols I and II, Athena Scientific, 1995.

D. Bertsimas, E. Litvinov, J. Xu-andy-sun, T. Zhao, and . Zheng, Adaptive Robust Optimization for the Security Constrained Unit Commitment Problem, IEEE Transactions on Power Systems, vol.28, issue.1, pp.52-63, 2013.
DOI : 10.1109/TPWRS.2012.2205021

A. Bourki, M. Coulm, and P. Rolet, Olivier Teytaud, and Paul Vayssì ere. Parameter Tuning by Simple Regret Algorithms and Multiple Simultaneous Hypothesis Testing, ICINCO2010, p.10, 2010.

B. Bouzy, Move-Pruning Techniques for Monte-Carlo Go, Advances in Computer Games 11, 2005.
DOI : 10.1007/11922155_8

. Justina and . Boyan, Technical update: Least-squares temporal difference learning, Machine Learning, pp.233-246, 2002.

. Stevenj, . Bradtke, . Andrewg, and . Barto, Linear least-squares algorithms for temporal difference learning, Machine Learning, pp.33-57, 1996.

S. Bubeck, R. Munos, and G. Stoltz, Pure exploration in finitely-armed and continuous-armed bandits, Theoretical Computer Science, vol.412, issue.19, pp.1832-1852, 2011.
DOI : 10.1016/j.tcs.2010.12.059

URL : https://hal.archives-ouvertes.fr/hal-00609550

S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári, Online optimization in x-armed bandits, NIPS, pp.201-208, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00329797

S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari, X-Armed Bandits, Journal of Machine Learning Research, vol.12, pp.1655-1695, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00450235

O. Buffet, C. Lee, W. Lin, and O. Teytaud, Optimistic Heuristics for MineSweeper, ICS -International Computer Symposium - 2012 of Smart Innovation, Systems and Technologies, pp.199-207, 2012.
DOI : 10.1007/978-3-642-35452-6_22

URL : https://hal.archives-ouvertes.fr/hal-00750577

L. Busoniu, R. Babuska, B. D. Schutter, and D. Ernst, Reinforcement learning and dynamic programming using function approximators, CRC Pr I Llc, vol.39, 2010.
DOI : 10.1201/9781439821091

URL : http://orbi.ulg.ac.be/jspui/handle/2268/27963

T. Cazenave and J. Borsboom, Golois wins phantom go tournament, ICGA Journal, vol.30, issue.3, pp.165-166, 2007.
DOI : 10.3233/icg-2009-32110

T. Cazenave and N. Jouandeau, On the parallelization of UCT, Proceedings of CGW07, pp.93-101, 2007.

T. Cazenave, A Phantom-Go Program, ACG, pp.120-125, 2006.
DOI : 10.1007/11922155_9

G. M. Chaslot, M. H. Winands, J. W. Uiterwijk, H. J. Van-den-herik, and B. Bouzy, Progressive Strategies for Monte-Carlo Tree Search, Proceedings of the 10th Joint Conference on Information Sciences, pp.655-661, 2007.

G. M. Chaslot, M. H. Winands, and H. J. Van-den-herik, Parallel Monte-Carlo Tree Search, Proceedings of the Conference on Computers and Games, 2008.
DOI : 10.1007/978-3-540-87608-3_6

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

G. Chaslot, S. Bakkes, I. Szita, and P. Spronck, Montecarlo tree search: A new framework for game ai, 2008.

E. Benjamin, J. H. Childs, L. Brodeur, and . Kocsis, Transpositions and move groups in Monte Carlo tree search, 2008.

C. Chou, P. Chou, C. Lee, D. Lupien-saint-pierre, O. Teytaud et al., Strategic Choices: Small Budgets and Simple Regret, 2012 Conference on Technologies and Applications of Artificial Intelligence, 2012.
DOI : 10.1109/TAAI.2012.35

URL : https://hal.archives-ouvertes.fr/hal-00753145

A. Couetoux, H. Doghmen, and O. Teytaud, Improving the Exploration in Upper Confidence Trees, Learning and Intelligent OptimizatioN Conference LION 6, 2012.
DOI : 10.1007/978-3-642-34413-8_29

URL : https://hal.archives-ouvertes.fr/hal-00745208

A. Couetoux, J. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard, Continuous Upper Confidence Trees, LION'11: Proceedings of the 5th International Conference on Learning and Intelligent OptimizatioN, p.page TBA, 2011.
DOI : 10.1016/0196-8858(85)90002-8

URL : https://hal.archives-ouvertes.fr/hal-00835352

A. Couetoux, M. Milone, M. Brendel, H. Doghmen, M. Sebag et al., Continuous Rapid Action Value Estimates, The 3rd Asian Conference on Machine Learning (ACML2011) Conference Proceedings, pp.19-31, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00642459

A. Couetoux, M. Milone, and O. Teytaud, Consistent Belief State Estimation, with Application to Mines, 2011 International Conference on Technologies and Applications of Artificial Intelligence, 2011.
DOI : 10.1109/TAAI.2011.55

URL : https://hal.archives-ouvertes.fr/hal-00712388

A. Couetoux, O. Teytaud, and H. Doghmen, Learning a movegenerator for upper confidence trees, Advances in Intelligent Systems and Applications - of Smart Innovation, Systems and Technologies, pp.209-218, 2013.

R. Coulom, Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, Proceedings of the 5th International Conference on Computers and Games, 2006.
DOI : 10.1007/978-3-540-75538-8_7

URL : https://hal.archives-ouvertes.fr/inria-00116992

R. Coulom, Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, Proceedings of the 5th International Conference on Computers and Games, 2006.
DOI : 10.1007/978-3-540-75538-8_7

URL : https://hal.archives-ouvertes.fr/inria-00116992

R. Coulom, Computing elo ratings of move patterns in the game of go, Computer Games Workshop, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00149859

R. Coulom, Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, Proceedings of the 5th international conference on Computers and games, CG'06, pp.72-83, 2007.
DOI : 10.1007/978-3-540-75538-8_7

URL : https://hal.archives-ouvertes.fr/inria-00116992

C. Dimitrakakis and M. G. Lagoudakis, Rollout sampling approximate policy iteration, Machine Learning, pp.157-171, 2008.
DOI : 10.1007/978-3-540-87479-9_6

URL : http://arxiv.org/abs/0805.2027

D. Peter, S. Drake, and . Uurtamo, Move Ordering vs Heavy Playouts: Where Should Heuristics be Applied in Monte Carlo Go, Proc. 3rd North Amer

D. Ernst, G. Bart-stan, J. Gonçalves, and L. Wehenkel, Clinical data based optimal STI strategies for HIV: a reinforcement learning approach, Proceedings of the 45th IEEE Conference on Decision and Control, pp.65-72, 2006.
DOI : 10.1109/CDC.2006.377527

URL : https://hal.archives-ouvertes.fr/hal-00121732

H. Finnsson and Y. Björnsson, Simulation-based approach to general game playing, AAAI'08: Proceedings of the 23rd national conference on Artificial intelligence, pp.259-264, 2008.

E. Gallestey, A. Stothert, M. Antoine, and S. Morton, Model predictive control and the optimization of power plant load while considering lifetime consumption, IEEE Transactions on, vol.17, issue.1, pp.186-191, 2002.

S. Gelly, J. B. Hoock, A. Rimmel, O. Teytaud, and Y. Kalemkarian, The parallelization of Monte-Carlo planning, Proceedings of the International Conference on Informatics in Control, Automation and Robotics, pp.198-203, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00287867

S. Gelly and D. Silver, Combining online and offline knowledge in UCT, Proceedings of the 24th international conference on Machine learning, ICML '07, pp.273-280, 2007.
DOI : 10.1145/1273496.1273531

URL : https://hal.archives-ouvertes.fr/inria-00164003

S. Gelly and D. Silver, Monte-Carlo tree search and rapid action value estimation in computer Go, Artificial Intelligence, vol.175, issue.11, pp.1856-1875, 2011.
DOI : 10.1016/j.artint.2011.03.007

S. Gelly and Y. Wang, Exploration exploitation in go: Uct for montecarlo go, NIPS-2006, Online trading between exploration and exploitation Workshop, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00115330

S. Gelly, Y. Wang, R. Munos, and O. Teytaud, Modification of uct with patterns in monte-carlo go, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00117266

P. Abraham, W. B. George, and . Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming, Mach. Learn, vol.65, issue.1, pp.167-198, 2006.

A. Gerevini and A. E. Howe, Amedeo Cesta, and Ioannis Refanidis, Proceedings of the 19th International Conference on Automated Planning and Scheduling, ICAPS 2009, 2009.

W. R. Gilks, Markov Chain Monte Carlo in Practice, 1995.

F. Gomez, J. Schmidhuber, and R. Miikkulainen, Efficient nonlinear control through neuroevolution, Proceedings of the European Conference on Machine Learning, pp.654-662, 2006.

D. Michael, L. G. Grigoriadis, and . Khachiyan, A sublinear-time randomized approximation algorithm for matrix games, Operations Research Letters, vol.18, issue.2, pp.53-58, 1995.

I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, A survey of actorcritic reinforcement learning: Standard and natural policy gradients. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, issue.6, pp.421291-1307, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00756747

L. Grüne, Error estimation and adaptive discretization for the discrete stochastic Hamilton?Jacobi?Bellman equation, Numerische Mathematik, vol.1, issue.1, pp.85-112, 2004.
DOI : 10.1007/s00211-004-0555-4

C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud, and M. Sebag, Multi-armed bandits, dynamic environments and meta-bandits, NIPS Workshop " online trading of exploration and exploitation, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00113668

T. Gordon and H. , Search in trees with chance nodes, 2004.

N. Hay and S. J. Russell, Metareasoning for monte carlo tree search, 2011.

V. I. Istratescu, Fixed Point Theory: An Introduction, 2002.
DOI : 10.1007/978-94-009-8177-5

M. J. Guillaume, J. Hoock, J. Perez, A. Rimmel, O. Teytaud et al., Meta Monte-Carlo Tree Search for Automatic Opening Book Generation, pp.7-12

H. Kato and I. Takeuchi, Parallel Monte-Carlo Tree Search with Simulation Servers, 2010 International Conference on Technologies and Applications of Artificial Intelligence, 2008.
DOI : 10.1109/TAAI.2010.83

R. D. Kleinberg, Nearly tight bounds for the continuum-armed bandit problem, NIPS, 2004.

J. Kloetzer, Monte-Carlo Opening Books for Amazons, Computers and Games, pp.124-135, 2011.
DOI : 10.1007/978-3-642-17928-0_12

L. Kocsis and C. Szepesvari, Bandit Based Monte-Carlo Planning, 15th European Conference on Machine Learning (ECML), pp.282-293, 2006.
DOI : 10.1007/11871842_29

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

L. Kocsis, C. Szepesvári, and J. Willemson, Improved monte-carlo search. working paper, 2006.

J. , Z. Kolter, and A. Y. Ng, Regularization and feature selection in least-squares temporal difference learning, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.521-528, 2009.

T. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol.6, issue.1, pp.4-22, 1985.
DOI : 10.1016/0196-8858(85)90002-8

T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol.6, issue.1, pp.4-22, 1985.
DOI : 10.1016/0196-8858(85)90002-8

M. Lanctot, A. Saffidine, J. Veness, C. Archibald, and M. H. Winands, Monte carlo *-minimax search. CoRR, abs/1304, 2013.

P. Hong, The Computational Intelligence of MoGo Revealed in Taiwan's Computer Go Tournaments, IEEE Transactions on Computational Intelligence and AI in games, 2009.

M. Legendre, K. Hollard, O. Buffet, and A. Dutech, Minesweeper: Where to probe?
URL : https://hal.archives-ouvertes.fr/hal-00723550

F. Maes, D. Ernst, and L. Wehenkel, Meta-learning of exploration/exploitation strategies: The multi-armed bandit case. CoRR, abs/1207, 2012.

S. Mannor, R. Rubinstein, and Y. Gat, The cross entropy method for fast policy search, International Conference on Machine Learning, pp.512-519, 2003.

C. R. Mansley, A. Weinstein, and M. L. Littman, Sample-based planning for continuous action markov decision processes, ICAPS. AAAI, 2011.

P. Marbach and J. N. Tsitsiklis, Approximate gradient methods in policyspace optimization of markov reward processes. Discrete Event Dynamic Systems, pp.111-148, 2003.

E. Martinot, C. Dienst, and L. Weiliang, Renewable Energy Futures: Targets, Scenarios, and Pathways, Annual Review of Environment and Resources, vol.32, issue.1, pp.205-239, 2007.
DOI : 10.1146/annurev.energy.32.080106.133554

URL : http://epub.wupperinst.org/frontdoor/index/index/docId/2781

P. Massé, Les Réserves et la Régulation de l'Avenir dans la vie Economique, 1946.

S. Meyer-nieberg and H. Beyer, Self-Adaptation in Evolutionary Algorithms, Parameter Setting in Evolutionary Algorithms, 2007.
DOI : 10.1007/978-3-540-69432-8_3

D. Michie, Game-playing and game-learning automata Advances in Programming and Non-numerical Computation, pp.183-196, 1966.

R. Munos, Policy gradient in continuous time, Journal of Machine Learning Research, vol.7, pp.771-791, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00117152

R. Munos and A. Moore, Variable resolution discretization in optimal control, Machine Learning, vol.49, issue.2/3, pp.291-323, 2002.
DOI : 10.1023/A:1017992615625

A. Nedic and D. P. Bertsekas, Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, pp.79-110, 2003.

M. V. Pereira and L. M. Pinto, Stochastic Optimization of a Multireservoir Hydroelectric System: A Decomposition Approach, Water Resources Research, vol.18, issue.4, pp.779-792, 1985.
DOI : 10.1029/WR021i006p00779

M. V. Pereira and L. M. Pinto, Multi-stage stochastic optimization applied to energy planning, Mathematical Programming, vol.4, issue.1-3, pp.359-375, 1991.
DOI : 10.1007/BF01582895

J. Peters and S. Schaal, Natural actor-critic, Neurocomputing, pp.71-78, 2008.
DOI : 10.1016/j.neucom.2007.11.026

W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics), 2007.

L. Martin and . Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1994.

J. Qin and T. A. Badgwell, A survey of industrial model predictive control technology, Control Engineering Practice, vol.11, issue.7, 2003.
DOI : 10.1016/S0967-0661(02)00186-7

R. Agrawal, Sample mean based index policies with o(logn) regret for the multiarmed bandit problem, Advances in Applied Probability, 1995.

M. Riedmiller, Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.
DOI : 10.1109/ADPRL.2007.368196

A. Rimmel and F. Teytaud, Multiple Overlapping Tiles for Contextual Monte Carlo Tree Search, Evostar
DOI : 10.1007/978-3-642-12239-2_21

URL : https://hal.archives-ouvertes.fr/inria-00456422

A. Rimmel, F. Teytaud, and O. Teytaud, Biasing Monte-Carlo Simulations through RAVE Values, The International Conference on Computers and Games, 2010.
DOI : 10.1007/978-3-642-17928-0_6

URL : https://hal.archives-ouvertes.fr/inria-00485555

P. Rolet, M. Sebag, and O. Teytaud, Optimal active learning through billiards and upper confidence trees in continous domains, Proceedings of the ECML conference, 2009.

P. Rolet, M. Sebag, and O. Teytaud, Optimal robust expensive optimization is tractable, Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO '09, 2009.
DOI : 10.1145/1569901.1570255

URL : https://hal.archives-ouvertes.fr/inria-00374910

G. A. Rummery and M. Niranjan, On-line q-learning using connectionist systems, 1994.

F. Schadd, Monte-Carlo Search Techniques in the Modern Board Game Thurn and Taxis, 2010.

M. Sebag and O. Teytaud, Combining Myopic Optimization and Tree Search: Application to MineSweeper, LION6, Learning and Intelligent Optimization, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00712417

M. Sebag and O. Teytaud, Combining Myopic Optimization and Tree Search: Application to MineSweeper, LION6, Learning and Intelligent Optimization Proc. LION 6, pp.222-236, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00712417

D. Silver and G. Tesauro, Monte-Carlo simulation balancing, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.945-952, 2009.
DOI : 10.1145/1553374.1553495

S. Singh, T. Jaakkola, . Michaell, C. Littman, and . Szepesvári, Convergence results for single-step on-policy reinforcement-learning algorithms, Machine Learning, pp.287-308, 2000.

T. G. Siqueira, M. Zambelli, M. Cicogna, M. Andrade, and S. Soares, Stochastic Dynamic Programming for Long Term Hydrothermal Scheduling Considering Different Streamflow Models, 2006 International Conference on Probabilistic Methods Applied to Power Systems, pp.1-6, 2006.
DOI : 10.1109/PMAPS.2006.360203

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

C. Szepesvári, Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol.4, issue.1, pp.1-103, 2010.
DOI : 10.2200/S00268ED1V01Y201005AIM009

F. Teytaud and O. Teytaud, Creating an Upper-Confidence-Tree Program for Havannah, ACG 12, 2009.
DOI : 10.1007/978-3-642-12993-3_7

URL : https://hal.archives-ouvertes.fr/inria-00380539

F. Teytaud and O. Teytaud, On the huge benefit of decisive moves in Monte-Carlo Tree Search algorithms, Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games, 2010.
DOI : 10.1109/ITW.2010.5593334

URL : https://hal.archives-ouvertes.fr/inria-00495078

O. Teytaud, Including ontologies in monte-carlo tree search and applications -an open source platform, 2008.

B. Tuffin, On the use of low discrepancy sequences in Monte Carlo methods, Monte Carlo Methods and Applications, vol.2, issue.4, 1996.
DOI : 10.1515/mcma.1996.2.4.295

A. Waldock and B. Carse, Fuzzy Q-Learning with an adaptive representation, 2008 IEEE International Conference on Fuzzy Systems (IEEE World Congress on Computational Intelligence), pp.720-725, 2008.
DOI : 10.1109/FUZZY.2008.4630449

Y. Wang, J. Audibert, and R. Munos, Algorithms for infinitely many-armed bandits, Advances in Neural Information Processing Systems, 2008.

Y. Wang and S. Gelly, Modifications of UCT and sequence-like simulations for Monte-Carlo Go, 2007 IEEE Symposium on Computational Intelligence and Games, pp.175-182, 2007.
DOI : 10.1109/CIG.2007.368095

A. Weinstein and M. L. Littman, Bandit-based planning and learning in continuous-action markov decision processes, ICAPS. AAAI, 2012.

S. Whiteson and P. Stone, Evolutionary function approximation for reinforcement learning, Journal of Machine Learning Research, vol.7, pp.877-917, 2006.

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, pp.229-256, 1992.

. Shi-jim-yen, . Shih-yuan, I. Chiu, and . Wu, Modark wins chinese dark chess tournament, ICGA Journal, vol.33, issue.4, pp.230-231, 2010.

H. Yu and D. P. Bertsekas, Basis function adaptation methods for cost approximation in mdp, " to appear, the proceedings of 2009 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2009.