D. Abbasi-yadkori, C. Pál, and . Szepesvári, Improved Algorithms for Linear Stochastic Bandits, Proceedings of the Advances in Neural Information Processing Systems 25, pp.2312-2320

A. Agarwal, Selective sampling algorithms for cost-sensitive multiclass prediction, Proceedings of the Thirtieth International Conference on Machine Learning, p.135, 2013.

A. Antos, R. Munos, and C. Szepesvári, Fitted Q-iteration in continuous action-space MDPs, Proceedings of the Advances in Neural Information Processing Systems 21, pp.9-16, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00185311

A. Antos, C. Szepesvári, and R. Munos, Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, vol.22, issue.1, pp.89-129, 2008.
DOI : 10.1007/s10994-007-5038-2
URL : https://hal.archives-ouvertes.fr/hal-00830201

A. Antos, V. Grover, and C. Szepesvári, Active learning in heteroscedastic noise, Theoretical Computer Science, vol.411, issue.29-30, pp.29-302712, 2010.
DOI : 10.1016/j.tcs.2010.04.007

J. Audibert, R. Munos, and C. Szepesvári, Tuning Bandit Algorithms in Stochastic Environments, Proceedings of the Eighteenth International Conference on Algorithmic Learning Theory, pp.150-165, 2007.
DOI : 10.1093/biomet/25.3-4.285
URL : https://hal.archives-ouvertes.fr/inria-00203487

J. Audibert, S. Bubeck, and R. Munos, Best Arm Identification in Multi-Armed Bandits, Proceedings of the Twenty-Third Conference on Learning Theory, pp.41-53, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00654404

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time analysis of the multi-armed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

P. Auer, N. Cesa-bianchi, Y. Freund, and R. Schapire, The Nonstochastic Multiarmed Bandit Problem, SIAM Journal on Computing, vol.32, issue.1, pp.48-77, 2003.
DOI : 10.1137/S0097539701398375

L. Baird, Residual Algorithms: Reinforcement Learning with Function Approximation, Proceedings of the Twelfth International Conference on Machine Learning, pp.30-37, 1995.
DOI : 10.1016/B978-1-55860-377-6.50013-X

A. Barto, R. Sutton, and C. Anderson, Neuron-Like Elements that can Solve Difficult Learning Control Problems, IEEE Transaction on Systems, Man and Cybernetics, vol.13, issue.34, pp.835-846, 1983.

R. Bellman, Dynamic Programming, p.17, 1957.

D. Benbouzid, R. Busa-fekete, N. Casagrande, F. Collin, and B. Kégl, MULTI- BOOST: A Multi-purpose Boosting Package, Journal Machine Learning Research, vol.13, issue.143, pp.549-553, 2012.

D. Bertsekas and S. Ioffe, Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming, p.58, 1996.

D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, vol.16, issue.81, pp.15-70, 1996.

S. Bradtke and A. Barto, Linear Least-Squares Algorithms for Temporal Difference Learning, Journal of Machine Learning, vol.22, issue.34, pp.33-57, 1996.

S. Bubeck, R. Munos, and G. Stoltz, Pure Exploration in Multi-armed Bandits Problems, Proceedings of the Twentieth International Conference on Algorithmic Learning Theory, pp.23-37, 2009.
DOI : 10.1090/S0002-9904-1952-09620-8

H. Burgiel, How to Lose at Tetris, The Mathematical Gazette, vol.81, issue.491, pp.194-200, 1997.
DOI : 10.2307/3619195

Z. Cai, D. Zhang, and B. Nebel, Playing tetris using bandit-based Monte-Carlo planning, AISB Symposium: AI and Games, pp.2011-80

P. Canbolat and U. Rothblum, (Approximate) iterated successive approximations algorithm for sequential decision processes, Annals of Operations Research, vol.3, issue.3, pp.1-12
DOI : 10.1007/s10479-012-1073-x

O. Cappé, A. Garivier, O. Maillard, R. Munos, and G. Stoltz, Kullback???Leibler upper confidence bounds for optimal sequential allocation, The Annals of Statistics, vol.41, issue.3, pp.1516-1541, 2013.
DOI : 10.1214/13-AOS1119SUPP

A. Carpentier, A. Lazaric, M. Ghavamzadeh, R. Munos, and P. Auer, Upper-Confidence-Bound Algorithms for Active Learning in Multi-armed Bandits, Proceedings of the Twenty-Second International Conference on Algorithmic Learning Theory, pp.189-203
DOI : 10.1007/978-3-642-24412-4_17
URL : https://hal.archives-ouvertes.fr/hal-00659696

C. Chang and C. Lin, LIBSVM, ACM Transactions on Intelligent Systems and Technology, vol.2, issue.3, pp.1-27, 2011.
DOI : 10.1145/1961189.1961199

O. Chapelle and L. Li, An empirical evaluation of thompson sampling, Proceedings of the Advances in Neural Information Processing Systems 25, pp.2249-2257, 2011.

V. Dani, T. Hayes, and S. Kakade, The Price of Bandit Information for Online Optimization, Proceedings of the Advances in Neural Information Processing Systems 21, p.26, 2007.

E. Demaine, S. Hohenberger, and D. Liben-nowell, Tetris is Hard, Even to Approximate, Proceedings of the Ninth International Computing and Combinatorics Conference, pp.351-363, 2003.
DOI : 10.1007/3-540-45071-8_36

K. Deng, J. Pineau, and S. Murphy, Active learning for personalizing treatment, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), p.30, 2011.
DOI : 10.1109/ADPRL.2011.5967348

C. Dimitrakakis and M. Lagoudakis, Rollout sampling approximate policy iteration, Machine Learning, vol.4, issue.1, pp.157-171, 2008.
DOI : 10.1007/s10994-008-5069-3

D. Ernst, P. Geurts, and L. Wehenkel, Tree-Based Batch Mode Reinforcement Learning, Journal of Machine Learning Research, vol.6, issue.8, pp.503-556, 2005.

E. Even-dar, S. Mannor, and Y. Mansour, Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems, Journal of Machine Learning Research, vol.7, issue.122, pp.1079-1105, 2006.

A. Farahmand, R. Munos, and C. Szepesvári, Error Propagation for Approximate Policy and Value Iteration, Proceedings of the Advances in Neural Information Processing Systems 24, pp.568-576, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00830154

A. Farahmand, D. Precup, and M. Ghavamzadeh, Generalized Classification-based Approximate Policy Iteration, Proceedings of the European Workshop on Reinforcement Learning (EWRL), pp.1-11, 2012.
DOI : 10.1109/tac.2015.2418411

V. Farias and B. Van-roy, Tetris: A Study of Randomized Constraint Sampling, 2006.
DOI : 10.1007/1-84628-095-8_6

A. Fern, S. Yoon, and R. Givan, Approximate policy iteration with a policy language bias, Proceedings of the Advances in Neural Information Processing Systems 18, 2004.

A. Fern, S. Yoon, and R. Givan, Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes, Journal of Artificial Intelligence Research, vol.25, issue.2, pp.75-118, 2006.

T. Furmston and D. Barber, A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes, Proceedings of the Advances in Neural Information Processing Systems 26, pp.2726-2734, 2012.

V. Gabillon, A. Lazaric, and M. Ghavamzadeh, Rollout Allocation Strategies for Classification-based Policy Iteration, Workshop on Reinforcement Learning and Search in Very Large Spaces, pp.2010-134

V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck, Multi-Bandit Best Arm Identification, Proceedings of the Advances in Neural Information Processing Systems 25, pp.2222-2230
URL : https://hal.archives-ouvertes.fr/hal-00632523

V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer, Classification-based Policy Iteration with a Critic, Proceedings of the Twenty-Eighth International Conference on Machine Learning, pp.1049-1056, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00590972

V. Gabillon, M. Ghavamzadeh, and A. Lazaric, Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence, Proceedings of the Advances in Neural Information Processing Systems 26, pp.3221-3229
URL : https://hal.archives-ouvertes.fr/hal-00747005

V. Gabillon, M. Ghavamzadeh, and B. Scherrer, Approximate Dynamic Programming Finally Performs Well in the Game of Tetris, Proceedings of the Advances in Neural Information Processing Systems 27, p.57, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00921250

M. Ghavamzadeh and A. Lazaric, Conservative and Greedy Approaches to Classification-Based Policy Iteration, Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp.2012-2034
URL : https://hal.archives-ouvertes.fr/hal-00772610

S. Goschin, A. Weinstein, and M. Littman, The Cross-Entropy Method Optimizes for Quantiles, Proceedings of the Thirtieth International Conference on Machine Learning, pp.1193-1201, 2013.

N. Hansen and A. Ostermeier, Completely Derandomized Self-Adaptation in Evolution Strategies, Evolutionary Computation, vol.9, issue.2, pp.159-195, 2001.
DOI : 10.1016/0004-3702(95)00124-7

J. Honda and A. Takemura, An asymptotically optimal policy for finite support models in the multiarmed bandit problem, Machine Learning, pp.361-391
DOI : 10.1007/s10994-011-5257-4

R. Howard, Dynamic Programming and Markov Processes, p.14, 1960.

S. Kakade, A natural policy gradient, Proceedings of the Advances in Neural Information Processing Systems 15, pp.1531-1538, 2001.

S. Kakade and J. Langford, Approximately optimal approximate reinforcement learning, Proceedings of the 19th International Conference on Machine Learning, pp.267-274, 2002.

S. Kalyanakrishnan, Learning Methods for Sequential Decision Making with Imperfect Representations, 2011.

S. Kalyanakrishnan and P. Stone, Efficient Selection of Multiple Bandit Arms: Theory and Practice, Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp.511-518, 2010.

S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone, PAC Subset Selection in Stochastic Multi-armed Bandits, Proceedings of the Twentieth International Conference on Machine Learning, pp.28-117, 2012.

Z. Karnin, T. Koren, and O. Somekh, Almost Optimal Exploration in Multi-Armed Bandits, Proceedings of the Thirtieth International Conference on Machine Learning, pp.28-29, 2013.

É. Kaufmann and S. Kalyanakrishnan, Information complexity in bandit subset selection, Proceedings of the Twenty-Sixth Conference on Learning Theory, pp.228-251, 2013.

É. Kaufmann, N. Korda, and R. Munos, Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis, Proceedings of the Twenty-Fourth International Conference on Algorithmic Learning Theory, pp.199-213, 2012.
DOI : 10.1007/978-3-642-34106-9_18
URL : https://hal.archives-ouvertes.fr/hal-00830033

M. Kearns, Y. Mansour, and A. Ng, Approximate Planning in Large POMDPs via Reusable Trajectories, Proceedings of the Advances in Neural Information Processing Systems 14, pp.1001-1007, 2000.

M. Keller, J. Mccullough, D. Klein, B. Arnow, D. Dunner et al., A Comparison of Nefazodone, the Cognitive Behavioral-Analysis System of Psychotherapy, and Their Combination for the Treatment of Chronic Depression, New England Journal of Medicine, vol.342, issue.20, pp.1462-1470, 2000.
DOI : 10.1056/NEJM200005183422001

M. Lagoudakis and R. Parr, Least-Squares Policy Iteration, Journal of Machine Learning Research, vol.4, issue.75, pp.1107-1149, 2003.

M. Lagoudakis and R. Parr, Reinforcement Learning as Classification: Leveraging Modern Classifiers, Proceedings of the Twentieth International Conference on Machine Learning, pp.424-431, 2003.

A. Lazaric, M. Ghavamzadeh, R. Munos, V. , X. Xi et al., Analysis of a Classification-based Policy Iteration Algorithm, Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp.607-614, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482065

A. Lazaric, M. Ghavamzadeh, and R. Munos, Analysis of a Classification-based Policy Iteration Algorithm, pp.34-41
URL : https://hal.archives-ouvertes.fr/inria-00482065

A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-Sample Analysis of LSTD, Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp.615-622
URL : https://hal.archives-ouvertes.fr/inria-00482189

A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-Sample Analysis of Least- Squares Policy Iteration, Journal of Machine Learning Research, vol.13, issue.56, pp.3041-3074, 2012.
URL : https://hal.archives-ouvertes.fr/inria-00528596

O. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh, Finite-Sample Analysis of Bellman Residual Minimization, Proceedings of the Second Asian Conference on Machine Learning, p.54, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00830212

S. Mannor and J. Tsitsiklis, The Sample Complexity of Exploration in the Multi- Armed Bandit Problem, Journal of Machine Learning Research, vol.5, pp.623-648, 2004.

S. Mannor, R. Rubinstein, and Y. Gat, The Cross Entropy method for Fast Policy Search, Proceedings of the Twentieth International Conference on Machine Learning, pp.512-519, 2003.

P. Marbach and J. Tsitsiklis, A Neuro-Dynamic Programming Approach to Call Admission Control in Integrated Service Networks: The Single Link Case, 1997.

O. Maron and A. Moore, Hoeffding races: Accelerating model selection search for classification and function approximation, Proceedings of the Advances in Neural Information Processing Systems 7, p.28, 1993.

H. Masnadi-shirazi, N. Vasconcelos, and A. Iranmehr, Cost-Sensitive Support Vector Machines, p.90, 2012.

A. Maurer and M. Pontil, Empirical Bernstein Bounds and Sample-Variance Penalization, Proceedings of the Twenty-Second Conference on Learning Theory, p.123, 2009.

V. Mnih, C. Szepesvári, and J. Audibert, Empirical Bernstein stopping, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.672-679, 2008.
DOI : 10.1145/1390156.1390241
URL : https://hal.archives-ouvertes.fr/hal-00834983

R. Munos, Error Bounds for Approximate Policy Iteration, Proceedings of the Twentieth International Conference on Machine Learning, pp.560-567, 2003.

R. Munos, Performance Bounds in $L_p$???norm for Approximate Value Iteration, SIAM Journal on Control and Optimization, vol.46, issue.2, pp.541-561, 2007.
DOI : 10.1137/040614384

R. Munos and C. Szepesvári, Finite-Time Bounds for Fitted Value Iteration, Journal of Machine Learning Research, vol.9, issue.70, pp.815-857, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00120882

A. Ng, H. J. Kim, M. Jordan, and S. Sastry, Autonomous Inverted Helicopter Flight via Reinforcement Learning, International Symposium on Experimental Robotics, 2004.
DOI : 10.1007/11552246_35

J. Nunen, A set of successive approximation methods for discounted Markovian decision problems, Zeitschrift f??r Operations Research, vol.29, issue.5, pp.203-208, 1976.
DOI : 10.1007/BF01920264

O. Pietquin, M. Geist, S. Chandramohan, and H. Frezza-buet, Sample-efficient batch reinforcement learning for dialogue management optimization, ACM Transactions on Speech and Language Processing, vol.7, issue.3, pp.1-7, 2011.
DOI : 10.1145/1966407.1966412
URL : https://hal.archives-ouvertes.fr/hal-00617517

B. Pires, M. Ghavamzadeh, and C. Szepesvári, Cost-sensitive Multiclass Classification Risk Bounds, Proceedings of the Thirtieth International Conference on Machine Learning, p.90, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00840485

D. Precup, R. Sutton, and S. Singh, Eligibility Traces for Off-Policy Policy Evaluation, Proceedings of the Seventeenth International Conference on Machine Learning, pp.759-766, 2000.

D. Precup, R. Sutton, and S. Dasgupta, Off-Policy Temporal Difference Learning with Function Approximation, Proceedings of the Eighteenth International Conference on Machine Learning, pp.417-424, 2001.

M. Puterman, Markov Decision Processes, p.14, 1994.
DOI : 10.1002/9780470316887

M. Puterman and M. Shin, Modified Policy Iteration Algorithms for Discounted Markov Decision Problems, Management Science, vol.24, issue.11, pp.17-57, 1978.
DOI : 10.1287/mnsc.24.11.1127

I. Rexakis and M. Lagoudakis, Directed Policy Search Using Relevance Vector Machines, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pp.25-32, 2012.
DOI : 10.1109/ICTAI.2012.13

H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, vol.58, issue.5, pp.527-535, 1952.
DOI : 10.1090/S0002-9904-1952-09620-8

R. Rubinstein and D. Kroese, The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, p.58, 2004.

B. Scherrer, Performance Bounds for ?-Policy Iteration and Application to the Game of Tetris, Journal of Machine Learning Research, vol.14, issue.80, pp.1175-1221, 2013.
URL : https://hal.archives-ouvertes.fr/inria-00185271

B. Scherrer and B. Lesner, On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes, NIPS, pp.1835-1843, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00758809

B. Scherrer, M. Ghavamzadeh, V. Gabillon, and M. Geist, Approximate Modified Policy Iteration, Proceedings of the Twenty Ninth International Conference on Machine Learning, pp.1207-1214
URL : https://hal.archives-ouvertes.fr/hal-00758882

P. Schweitzer and A. Seidman, Generalized polynomial approximations in Markovian decision processes, Journal of Mathematical Analysis and Applications, vol.110, issue.2, pp.568-582, 1985.
DOI : 10.1016/0022-247X(85)90317-8

D. I. Simester, P. Sun, and J. Tsitsiklis, Dynamic Catalog Mailing Policies, Management Science, vol.52, issue.5, pp.683-696, 2006.
DOI : 10.1287/mnsc.1050.0504

R. Sutton, Temporal credit assignment in reinforcement learning, 1984.

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

. Cs and . Szepesvári, Reinforcement Learning Algorithms for MDPs, Wiley Encyclopedia of Operations Research, p.19, 2010.

I. Szita and A. L?rincz, Learning Tetris Using the Noisy Cross-Entropy Method, Neural Computation, vol.18, issue.12, pp.2936-2941, 2006.
DOI : 10.1007/s10479-005-5732-z

G. Tesauro, TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play, Neural Computation, vol.23, issue.2, pp.215-219, 1994.
DOI : 10.1162/neco.1989.1.3.321

C. Thiéry and B. Scherrer, Building Controllers for Tetris, International Computer Games Association Journal, pp.3-11, 2009.
DOI : 10.3233/ICG-2009-32102

C. Thiéry and B. Scherrer, Improvements on Learning Tetris with Cross Entropy. International Computer Games Association Journal, pp.81-82, 2009.

C. Thiéry and B. Scherrer, Least-Squares ?-Policy Iteration: Bias-Variance Trade-off in Control Problems, Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp.1071-1078, 2010.

C. Thiéry and B. Scherrer, MDPTetris features documentation, p.81, 2010.

C. Thiéry and B. Scherrer, Performance Bound for Approximate Optimistic Policy Iteration, pp.2010-67

J. Tsitsiklis and B. Van-roy, Feature-Based Methods for Large Scale Dynamic Programming, Machine Learning, pp.59-94, 1996.

J. Tsitsiklis and B. Van-roy, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, issue.5, pp.674-690, 1997.
DOI : 10.1109/9.580874

T. Wang, N. Viswanathan, and S. Bubeck, Multiple Identifications in Multi-Armed Bandits, Proceedings of the Thirtiethth International Conference on Machine Learning, pp.258-265, 2013.

Y. Wang, J. Audibert, and R. Munos, Algorithms for Infinitely Many-Armed Bandits, NIPS, pp.1729-1736, 2008.