Improved Algorithms for Linear Stochastic Bandits, Proceedings of the Advances in Neural Information Processing Systems 25, pp.2312-2320 ,
Selective sampling algorithms for cost-sensitive multiclass prediction, Proceedings of the Thirtieth International Conference on Machine Learning, p.135, 2013. ,
Fitted Q-iteration in continuous action-space MDPs, Proceedings of the Advances in Neural Information Processing Systems 21, pp.9-16, 2007. ,
URL : https://hal.archives-ouvertes.fr/inria-00185311
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, vol.22, issue.1, pp.89-129, 2008. ,
DOI : 10.1007/s10994-007-5038-2
URL : https://hal.archives-ouvertes.fr/hal-00830201
Active learning in heteroscedastic noise, Theoretical Computer Science, vol.411, issue.29-30, pp.29-302712, 2010. ,
DOI : 10.1016/j.tcs.2010.04.007
Tuning Bandit Algorithms in Stochastic Environments, Proceedings of the Eighteenth International Conference on Algorithmic Learning Theory, pp.150-165, 2007. ,
DOI : 10.1093/biomet/25.3-4.285
URL : https://hal.archives-ouvertes.fr/inria-00203487
Best Arm Identification in Multi-Armed Bandits, Proceedings of the Twenty-Third Conference on Learning Theory, pp.41-53, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00654404
Finite-time analysis of the multi-armed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002. ,
DOI : 10.1023/A:1013689704352
The Nonstochastic Multiarmed Bandit Problem, SIAM Journal on Computing, vol.32, issue.1, pp.48-77, 2003. ,
DOI : 10.1137/S0097539701398375
Residual Algorithms: Reinforcement Learning with Function Approximation, Proceedings of the Twelfth International Conference on Machine Learning, pp.30-37, 1995. ,
DOI : 10.1016/B978-1-55860-377-6.50013-X
Neuron-Like Elements that can Solve Difficult Learning Control Problems, IEEE Transaction on Systems, Man and Cybernetics, vol.13, issue.34, pp.835-846, 1983. ,
Dynamic Programming, p.17, 1957. ,
MULTI- BOOST: A Multi-purpose Boosting Package, Journal Machine Learning Research, vol.13, issue.143, pp.549-553, 2012. ,
Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming, p.58, 1996. ,
Neuro-Dynamic Programming, Athena Scientific, vol.16, issue.81, pp.15-70, 1996. ,
Linear Least-Squares Algorithms for Temporal Difference Learning, Journal of Machine Learning, vol.22, issue.34, pp.33-57, 1996. ,
Pure Exploration in Multi-armed Bandits Problems, Proceedings of the Twentieth International Conference on Algorithmic Learning Theory, pp.23-37, 2009. ,
DOI : 10.1090/S0002-9904-1952-09620-8
How to Lose at Tetris, The Mathematical Gazette, vol.81, issue.491, pp.194-200, 1997. ,
DOI : 10.2307/3619195
Playing tetris using bandit-based Monte-Carlo planning, AISB Symposium: AI and Games, pp.2011-80 ,
(Approximate) iterated successive approximations algorithm for sequential decision processes, Annals of Operations Research, vol.3, issue.3, pp.1-12 ,
DOI : 10.1007/s10479-012-1073-x
Kullback???Leibler upper confidence bounds for optimal sequential allocation, The Annals of Statistics, vol.41, issue.3, pp.1516-1541, 2013. ,
DOI : 10.1214/13-AOS1119SUPP
Upper-Confidence-Bound Algorithms for Active Learning in Multi-armed Bandits, Proceedings of the Twenty-Second International Conference on Algorithmic Learning Theory, pp.189-203 ,
DOI : 10.1007/978-3-642-24412-4_17
URL : https://hal.archives-ouvertes.fr/hal-00659696
LIBSVM, ACM Transactions on Intelligent Systems and Technology, vol.2, issue.3, pp.1-27, 2011. ,
DOI : 10.1145/1961189.1961199
An empirical evaluation of thompson sampling, Proceedings of the Advances in Neural Information Processing Systems 25, pp.2249-2257, 2011. ,
The Price of Bandit Information for Online Optimization, Proceedings of the Advances in Neural Information Processing Systems 21, p.26, 2007. ,
Tetris is Hard, Even to Approximate, Proceedings of the Ninth International Computing and Combinatorics Conference, pp.351-363, 2003. ,
DOI : 10.1007/3-540-45071-8_36
Active learning for personalizing treatment, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), p.30, 2011. ,
DOI : 10.1109/ADPRL.2011.5967348
Rollout sampling approximate policy iteration, Machine Learning, vol.4, issue.1, pp.157-171, 2008. ,
DOI : 10.1007/s10994-008-5069-3
Tree-Based Batch Mode Reinforcement Learning, Journal of Machine Learning Research, vol.6, issue.8, pp.503-556, 2005. ,
Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems, Journal of Machine Learning Research, vol.7, issue.122, pp.1079-1105, 2006. ,
Error Propagation for Approximate Policy and Value Iteration, Proceedings of the Advances in Neural Information Processing Systems 24, pp.568-576, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00830154
Generalized Classification-based Approximate Policy Iteration, Proceedings of the European Workshop on Reinforcement Learning (EWRL), pp.1-11, 2012. ,
DOI : 10.1109/tac.2015.2418411
Tetris: A Study of Randomized Constraint Sampling, 2006. ,
DOI : 10.1007/1-84628-095-8_6
Approximate policy iteration with a policy language bias, Proceedings of the Advances in Neural Information Processing Systems 18, 2004. ,
Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes, Journal of Artificial Intelligence Research, vol.25, issue.2, pp.75-118, 2006. ,
A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes, Proceedings of the Advances in Neural Information Processing Systems 26, pp.2726-2734, 2012. ,
Rollout Allocation Strategies for Classification-based Policy Iteration, Workshop on Reinforcement Learning and Search in Very Large Spaces, pp.2010-134 ,
Multi-Bandit Best Arm Identification, Proceedings of the Advances in Neural Information Processing Systems 25, pp.2222-2230 ,
URL : https://hal.archives-ouvertes.fr/hal-00632523
Classification-based Policy Iteration with a Critic, Proceedings of the Twenty-Eighth International Conference on Machine Learning, pp.1049-1056, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00590972
Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence, Proceedings of the Advances in Neural Information Processing Systems 26, pp.3221-3229 ,
URL : https://hal.archives-ouvertes.fr/hal-00747005
Approximate Dynamic Programming Finally Performs Well in the Game of Tetris, Proceedings of the Advances in Neural Information Processing Systems 27, p.57, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00921250
Conservative and Greedy Approaches to Classification-Based Policy Iteration, Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp.2012-2034 ,
URL : https://hal.archives-ouvertes.fr/hal-00772610
The Cross-Entropy Method Optimizes for Quantiles, Proceedings of the Thirtieth International Conference on Machine Learning, pp.1193-1201, 2013. ,
Completely Derandomized Self-Adaptation in Evolution Strategies, Evolutionary Computation, vol.9, issue.2, pp.159-195, 2001. ,
DOI : 10.1016/0004-3702(95)00124-7
An asymptotically optimal policy for finite support models in the multiarmed bandit problem, Machine Learning, pp.361-391 ,
DOI : 10.1007/s10994-011-5257-4
Dynamic Programming and Markov Processes, p.14, 1960. ,
A natural policy gradient, Proceedings of the Advances in Neural Information Processing Systems 15, pp.1531-1538, 2001. ,
Approximately optimal approximate reinforcement learning, Proceedings of the 19th International Conference on Machine Learning, pp.267-274, 2002. ,
Learning Methods for Sequential Decision Making with Imperfect Representations, 2011. ,
Efficient Selection of Multiple Bandit Arms: Theory and Practice, Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp.511-518, 2010. ,
PAC Subset Selection in Stochastic Multi-armed Bandits, Proceedings of the Twentieth International Conference on Machine Learning, pp.28-117, 2012. ,
Almost Optimal Exploration in Multi-Armed Bandits, Proceedings of the Thirtieth International Conference on Machine Learning, pp.28-29, 2013. ,
Information complexity in bandit subset selection, Proceedings of the Twenty-Sixth Conference on Learning Theory, pp.228-251, 2013. ,
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis, Proceedings of the Twenty-Fourth International Conference on Algorithmic Learning Theory, pp.199-213, 2012. ,
DOI : 10.1007/978-3-642-34106-9_18
URL : https://hal.archives-ouvertes.fr/hal-00830033
Approximate Planning in Large POMDPs via Reusable Trajectories, Proceedings of the Advances in Neural Information Processing Systems 14, pp.1001-1007, 2000. ,
A Comparison of Nefazodone, the Cognitive Behavioral-Analysis System of Psychotherapy, and Their Combination for the Treatment of Chronic Depression, New England Journal of Medicine, vol.342, issue.20, pp.1462-1470, 2000. ,
DOI : 10.1056/NEJM200005183422001
Least-Squares Policy Iteration, Journal of Machine Learning Research, vol.4, issue.75, pp.1107-1149, 2003. ,
Reinforcement Learning as Classification: Leveraging Modern Classifiers, Proceedings of the Twentieth International Conference on Machine Learning, pp.424-431, 2003. ,
Analysis of a Classification-based Policy Iteration Algorithm, Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp.607-614, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00482065
Analysis of a Classification-based Policy Iteration Algorithm, pp.34-41 ,
URL : https://hal.archives-ouvertes.fr/inria-00482065
Finite-Sample Analysis of LSTD, Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp.615-622 ,
URL : https://hal.archives-ouvertes.fr/inria-00482189
Finite-Sample Analysis of Least- Squares Policy Iteration, Journal of Machine Learning Research, vol.13, issue.56, pp.3041-3074, 2012. ,
URL : https://hal.archives-ouvertes.fr/inria-00528596
Finite-Sample Analysis of Bellman Residual Minimization, Proceedings of the Second Asian Conference on Machine Learning, p.54, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00830212
The Sample Complexity of Exploration in the Multi- Armed Bandit Problem, Journal of Machine Learning Research, vol.5, pp.623-648, 2004. ,
The Cross Entropy method for Fast Policy Search, Proceedings of the Twentieth International Conference on Machine Learning, pp.512-519, 2003. ,
A Neuro-Dynamic Programming Approach to Call Admission Control in Integrated Service Networks: The Single Link Case, 1997. ,
Hoeffding races: Accelerating model selection search for classification and function approximation, Proceedings of the Advances in Neural Information Processing Systems 7, p.28, 1993. ,
Cost-Sensitive Support Vector Machines, p.90, 2012. ,
Empirical Bernstein Bounds and Sample-Variance Penalization, Proceedings of the Twenty-Second Conference on Learning Theory, p.123, 2009. ,
Empirical Bernstein stopping, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.672-679, 2008. ,
DOI : 10.1145/1390156.1390241
URL : https://hal.archives-ouvertes.fr/hal-00834983
Error Bounds for Approximate Policy Iteration, Proceedings of the Twentieth International Conference on Machine Learning, pp.560-567, 2003. ,
Performance Bounds in $L_p$???norm for Approximate Value Iteration, SIAM Journal on Control and Optimization, vol.46, issue.2, pp.541-561, 2007. ,
DOI : 10.1137/040614384
Finite-Time Bounds for Fitted Value Iteration, Journal of Machine Learning Research, vol.9, issue.70, pp.815-857, 2008. ,
URL : https://hal.archives-ouvertes.fr/inria-00120882
Autonomous Inverted Helicopter Flight via Reinforcement Learning, International Symposium on Experimental Robotics, 2004. ,
DOI : 10.1007/11552246_35
A set of successive approximation methods for discounted Markovian decision problems, Zeitschrift f??r Operations Research, vol.29, issue.5, pp.203-208, 1976. ,
DOI : 10.1007/BF01920264
Sample-efficient batch reinforcement learning for dialogue management optimization, ACM Transactions on Speech and Language Processing, vol.7, issue.3, pp.1-7, 2011. ,
DOI : 10.1145/1966407.1966412
URL : https://hal.archives-ouvertes.fr/hal-00617517
Cost-sensitive Multiclass Classification Risk Bounds, Proceedings of the Thirtieth International Conference on Machine Learning, p.90, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00840485
Eligibility Traces for Off-Policy Policy Evaluation, Proceedings of the Seventeenth International Conference on Machine Learning, pp.759-766, 2000. ,
Off-Policy Temporal Difference Learning with Function Approximation, Proceedings of the Eighteenth International Conference on Machine Learning, pp.417-424, 2001. ,
Markov Decision Processes, p.14, 1994. ,
DOI : 10.1002/9780470316887
Modified Policy Iteration Algorithms for Discounted Markov Decision Problems, Management Science, vol.24, issue.11, pp.17-57, 1978. ,
DOI : 10.1287/mnsc.24.11.1127
Directed Policy Search Using Relevance Vector Machines, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pp.25-32, 2012. ,
DOI : 10.1109/ICTAI.2012.13
Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, vol.58, issue.5, pp.527-535, 1952. ,
DOI : 10.1090/S0002-9904-1952-09620-8
The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, p.58, 2004. ,
Performance Bounds for ?-Policy Iteration and Application to the Game of Tetris, Journal of Machine Learning Research, vol.14, issue.80, pp.1175-1221, 2013. ,
URL : https://hal.archives-ouvertes.fr/inria-00185271
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes, NIPS, pp.1835-1843, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00758809
Approximate Modified Policy Iteration, Proceedings of the Twenty Ninth International Conference on Machine Learning, pp.1207-1214 ,
URL : https://hal.archives-ouvertes.fr/hal-00758882
Generalized polynomial approximations in Markovian decision processes, Journal of Mathematical Analysis and Applications, vol.110, issue.2, pp.568-582, 1985. ,
DOI : 10.1016/0022-247X(85)90317-8
Dynamic Catalog Mailing Policies, Management Science, vol.52, issue.5, pp.683-696, 2006. ,
DOI : 10.1287/mnsc.1050.0504
Temporal credit assignment in reinforcement learning, 1984. ,
Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998. ,
DOI : 10.1109/TNN.1998.712192
Reinforcement Learning Algorithms for MDPs, Wiley Encyclopedia of Operations Research, p.19, 2010. ,
Learning Tetris Using the Noisy Cross-Entropy Method, Neural Computation, vol.18, issue.12, pp.2936-2941, 2006. ,
DOI : 10.1007/s10479-005-5732-z
TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play, Neural Computation, vol.23, issue.2, pp.215-219, 1994. ,
DOI : 10.1162/neco.1989.1.3.321
Building Controllers for Tetris, International Computer Games Association Journal, pp.3-11, 2009. ,
DOI : 10.3233/ICG-2009-32102
Improvements on Learning Tetris with Cross Entropy. International Computer Games Association Journal, pp.81-82, 2009. ,
Least-Squares ?-Policy Iteration: Bias-Variance Trade-off in Control Problems, Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp.1071-1078, 2010. ,
MDPTetris features documentation, p.81, 2010. ,
Performance Bound for Approximate Optimistic Policy Iteration, pp.2010-67 ,
Feature-Based Methods for Large Scale Dynamic Programming, Machine Learning, pp.59-94, 1996. ,
An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, issue.5, pp.674-690, 1997. ,
DOI : 10.1109/9.580874
Multiple Identifications in Multi-Armed Bandits, Proceedings of the Thirtiethth International Conference on Machine Learning, pp.258-265, 2013. ,
Algorithms for Infinitely Many-Armed Bandits, NIPS, pp.1729-1736, 2008. ,