, Bibliography Optimization and Nonsmooth Analysis

A. Abadi, P. Agarwal, E. Barham, Z. Brevdo, C. Chen et al., Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org. (? page 131

N. Akchurina, Multiagent Reinforcement Learning: Algorithm Converging to Nash Equilibrium in General-Sum Discounted Stochastic Games, Proc. of AAMAS, p.38, 2009.

A. Antos, C. Szepesvári, and R. Munos, Fitted-Q Iteration in Continuous Action-Space MDPs, Proc. of NIPS, p.18, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00185311

A. Antos, C. Szepesvári, and R. Munos, Learning near-optimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path, Machine Learning, p.21, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00117130

T. Archibald, K. Mckinnon, and L. Thomas, On the Generation of Markov Decision Processes, Journal of the Operational Research Society, vol.46, issue.3, pp.354-361, 1995.
DOI : 10.1057/jors.1995.50

J. A. Bagnell, S. M. Kakade, J. G. Schneider, and A. Y. Ng, Policy Search by Dynamic Programming, Proc. of NIPS, page None, p.68, 2003.

L. Baird, Residual Algorithms: Reinforcement Learning with Function Approximation, Proc. of ICML, pp.17-117, 1995.
DOI : 10.1016/B978-1-55860-377-6.50013-X

B. Banerjee and J. Peng, Adaptive policy gradient in multiagent learning, Proceedings of the second international joint conference on Autonomous agents and multiagent systems , AAMAS '03, p.39, 2003.
DOI : 10.1145/860575.860686

R. Bellman, Dynamic Programming, p.11, 1957.

R. Bellman, R. Kalaba, and B. Kotkin, Polynomial approximation?a new computational technique in dynamic programming: Allocation processes, Mathematics of Computation, vol.17, p.18, 1963.

D. P. , Bertsekas. Dynamic Programming and Optimal Control, vol.1, issue.93, p.91, 1995.

, Bibliography

V. Borkar, Stochastic approximation with two time scales, Systems & Control Letters, vol.29, issue.5, pp.291-294
DOI : 10.1016/S0167-6911(97)90015-3

V. Borkar, Stochastic approximation with two time scales, Systems & Control Letters, vol.29, issue.5, p.146, 1997.
DOI : 10.1016/S0167-6911(97)90015-3

V. S. Borkar, Stochastic approximation with 'controlled markov'noise, Systems & control letters, vol.154, p.148, 2006.

V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, p.154, 2009.

R. N. Borkovsky, U. Doraszelski, and Y. Kryukov, A user's guide to solving dynamic stochastic games using the homotopy method, Operations Research, vol.58, pp.2010-2047

B. Bo?anský, V. Lisý, M. Lanctot, J. ?ermák, and M. Winands, Algorithms for computing strategies in two-player simultaneous move games, Artificial Intelligence, vol.237, issue.150, pp.1-40, 2016.
DOI : 10.1016/j.artint.2016.03.005

M. Bowling and M. Veloso, Rational and Convergent Learning in Stochastic Games, Proc. of IJCAI, pp.39-146, 2001.

L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees, p.52, 1984.

S. Bubeck and N. Cesa-bianchi, Regret analysis of stochastic and nonstochastic multiarmed bandit problems, Machine Learning, pp.1-122

M. Buro, Solving the Oshi-Zumo Game, pp.361-366, 2004.
DOI : 10.1007/978-0-387-35706-5_23

L. Busoniu, R. Babuska, and B. Schutter, A Comprehensive Survey of Multiagent Reinforcement Learning, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol.38, issue.2, p.155, 2008.
DOI : 10.1109/TSMCC.2007.913919

L. Bu?oniu, D. Ernst, B. D. Schutter, and R. Babu?ka, Online least-squares policy iteration for reinforcement learning control, American Control Conference, pp.2010-2031, 2010.

N. Cesa-bianchi and G. Lugosi, Prediction, Learning, and Games, p.39, 2006.
DOI : 10.1017/CBO9780511546921

R. Correa and A. Seeger, Directional Derivative of a Minimax Function Nonlinear Analysis: Theory, Methods & Applications, vol.9, issue.108, pp.13-22, 1985.

C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou, The Complexity of Computing a Nash Equilibrium, SIAM Journal on Computing, vol.39, issue.1, pp.195-259, 2009.
DOI : 10.1137/070699652

M. Dermed and C. L. Isbell, Solving stochastic games, Advances in Neural Information Processing Systems, p.38, 2009.

D. Ernst, P. Geurts, and L. Wehenkel, Tree-Based Batch Mode Reinforcement Learning, Journal of Machine Learning Research, pp.503-556, 2005.

A. Farahmand, C. Szepesvári, and R. Munos, Error Propagation for Approximate Policy and Value Iteration, Proc. of NIPS, pp.2010-2031
DOI : 10.1109/tac.2015.2418411

URL : https://hal.archives-ouvertes.fr/hal-00830154

J. Filar and K. Vrieze, Competitive Markov Decision Processes, p.29, 2012.
DOI : 10.1007/978-1-4612-4054-9

J. A. Filar and B. Tolwinski, On the Algorithm of Pollatschek and Avi-ltzhak, pp.32-112, 1991.
DOI : 10.1007/978-94-011-3760-7_6

V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer, Classification-Based Policy Iteration with a Critic, Proc. of ICML, pp.1049-1056
URL : https://hal.archives-ouvertes.fr/hal-00590972

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Book in preparation for, pp.2016-131

A. Greenwald, K. Hall, and R. Serrano, Correlated Q-learning, Proc. of ICML, p.39, 2003.

S. Grunewalder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton, Modelling Transition Dynamics in MDPs With RKHS Embeddings, Proc. of ICML, p.24, 2012.

T. D. Hansen, P. B. Miltersen, and U. Zwick, Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor, Journal of the ACM, vol.60, issue.1, pp.2013-2029
DOI : 10.1145/2432622.2432623

T. D. Hansen, P. B. Miltersen, and U. Zwick, Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor, Journal of the ACM, vol.60, issue.1, p.34, 2013.
DOI : 10.1145/2432622.2432623

URL : http://arxiv.org/pdf/1008.0530

S. Hart and A. Mas, Uncoupled dynamics do not lead to nash equilibrium. The American Economic Review, p.147, 2003.
DOI : 10.1142/9789814390705_0007

URL : http://www.ma.huji.ac.il/hart/papers/uncoupl.pdf

J. Heinrich, M. Lanctot, and D. Silver, Fictitious self-play in extensive-form games, Proc. of ICML, pp.2015-2054

J. Herings and R. J. Peeters, Stationary Equilibria in Stochastic Games: Structure, Selection and Computation, SSRN Electronic Journal, vol.118, p.37, 2004.
DOI : 10.2139/ssrn.357201

P. J. Herings and R. Peeters, Homotopy methods to compute equilibria in game theory, Economic Theory, vol.42, pp.2010-2047

, Bibliography

J. Hofbauer and W. Sandholm, On the Global Convergence of Stochastic Fictitious Play, Econometrica, vol.70, issue.6, pp.2265-2294, 2002.
DOI : 10.1111/1468-0262.00376

A. J. Hoffman and R. M. Karp, On Nonterminating Stochastic Games, Management Science, vol.12, issue.5, pp.359-370, 1966.
DOI : 10.1287/mnsc.12.5.359

J. Hu and M. P. Wellman, Nash Q-Learning for General-Sum Stochastic Games, Journal of Machine Learning Research, vol.4, pp.1039-1069, 2003.

E. Kalai and E. Lehrer, Rational Learning Leads to Nash Equilibrium, Econometrica, vol.61, issue.5, p.38, 1993.
DOI : 10.2307/2951492

URL : http://www.kellogg.northwestern.edu/research/math/papers/895.pdf

N. Karmarkar, A New Polynomial-time Algorithm for Linear Programming, Proc. of ACM Symposium on Theory of Computing, p.27, 1984.
DOI : 10.1007/bf02579150

M. Kearns, Y. Mansour, and S. Singh, Fast Planning in Stochastic Games, Proc. of UAI, p.93, 2000.

D. Koller and R. Parr, Policy Iteration for Factored MDPs, Proc. of UAI, pp.326-334, 2000.

D. Koller, N. Megiddo, and B. V. Stengel, Fast algorithms for finding randomized strategies in game trees, Proceedings of the twenty-sixth annual ACM symposium on Theory of computing , STOC '94, p.27, 1994.
DOI : 10.1145/195058.195451

M. G. Lagoudakis and R. Parr, Value Function Approximation in Zero-Sum Markov Games, Proc. of UAI, p.36, 2002.

M. G. Lagoudakis and R. Parr, Least-Squares Policy Iteration, Journal of Machine Learning Research, vol.19, issue.111, pp.1107-1149, 2003.

M. Lanctot, K. Waugh, M. Zinkevich, and M. Bowling, Monte carlo sampling for regret minimization in extensive games, Proc. of NIPS, p.38, 2009.

G. J. Laurent, L. Matignon, and N. L. Fort-piat, The world of independent learners is not markovian, International Journal of Knowledge-based and Intelligent Engineering Systems, vol.15, issue.1, pp.2011-2049
DOI : 10.3233/KES-2010-0206

URL : https://hal.archives-ouvertes.fr/hal-00601941

A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-sample analysis of least-squares policy iteration, Journal of Machine Learning Research, vol.13, pp.2012-2033
URL : https://hal.archives-ouvertes.fr/inria-00528596

Y. Lecun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol.9, issue.7553, pp.436-444
DOI : 10.1007/s10994-013-5335-x

D. Leslie and E. Collins, Generalised weakened fictitious play, Games and Economic Behavior, vol.56, issue.2, pp.285-298, 2006.
DOI : 10.1016/j.geb.2005.08.005

B. Lesner and B. Scherrer, Non-stationary approximate modified policy iteration, pp.16-23, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01186664

A. S. Lewis and M. L. Overton, Nonsmooth Optimization via BFGS, p.106, 2009.
DOI : 10.1007/s10107-012-0514-2

URL : http://www.cs.nyu.edu/faculty/overton/papers/pdffiles/nsoquasi.pdf

A. S. Lewis and M. L. Overton, Nonsmooth optimization via quasi-Newton methods, Mathematical Programming, vol.128, issue.1-2, pp.135-163, 2013.
DOI : 10.1175/1520-0493(2000)129<4031:UODANO>2.0.CO;2

URL : http://www.cs.nyu.edu/faculty/overton/papers/pdffiles/nsoquasi.pdf

L. Li, M. L. Littman, and C. R. Mansley, Online exploration in least-squares policy iteration, Proc. of AAMAS, p.21, 2009.

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez et al., Continuous Control with Deep Reinforcement Learning, Proc. of ICLR, pp.2016-128

M. L. Littman, Markov games as a framework for multi-agent reinforcement learning, Proc. of ICML, p.39, 1994.
DOI : 10.1016/B978-1-55860-335-6.50027-1

URL : http://www.ee.duke.edu/~lcarin/emag/seminar_presentations/Markov_Games_Littman.pdf

H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, Toward Off-Policy Learning Control with Function Approximation, Proc. of ICML, pp.719-726

O. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh, Finite-Sample Analysis of Bellman Residual Minimization, Proc. of ACML, pp.124-127, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00830212

C. Meyer, J. Ganascia, and J. Zucker, Learning Strategies in Games by Anticipation, Proc. of IJCAI 97 Volumes, pp.698-707, 1997.
URL : https://hal.archives-ouvertes.fr/hal-01649000

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness et al., Human-level control through deep reinforcement learning, Nature, vol.101, issue.7540, pp.529-533
DOI : 10.1016/S0004-3702(98)00023-X

R. Munos, Performance Bounds in $L_p$???norm for Approximate Value Iteration, SIAM Journal on Control and Optimization, vol.46, issue.2, pp.541-561, 2007.
DOI : 10.1137/040614384

URL : http://hal.archives-ouvertes.fr/docs/00/12/46/85/PDF/avi_siam_final.pdf

R. Munos and C. Szepesvári, Finite-Time Bounds for Fitted Value Iteration, The Journal of Machine Learning Research, vol.9, pp.815-857, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00120882

N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani, Algorithmic Game Theory, p.26, 2007.
DOI : 10.1017/CBO9780511800481

J. Nocedal and S. Wright, Numerical Optimization, 2006.
DOI : 10.1007/b98874

S. D. Patek, Stochastic Shortest Path Games, SIAM Journal on Control and Optimization, vol.37, issue.3, pp.31-33, 1997.
DOI : 10.1137/S0363012996299557

URL : http://www-mit.mit.edu/dimitrib/www/sspg.pdf

J. Perolat, B. Scherrer, B. Piot, and O. Pietquin, Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games, Proc. of ICML, pp.43-91, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01153270

J. Perolat, B. Piot, B. Scherrer, and O. Pietquin, On the use of non-stationary strategies for solving two-player zero-sum markov games, Proc. of AISTATS, p.63, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01291495

B. Piot, M. Geist, and O. Pietquin, Difference of convex functions programming for reinforcement learning, Proc. of NIPS, 2014a. (? pages 23, pp.128-134
URL : https://hal.archives-ouvertes.fr/hal-01104419

B. Piot, M. Geist, and O. Pietquin, Boosted Bellman Residual Minimization Handling Expert Demonstrations, Proc. of ECML, p.24, 2014.
DOI : 10.1007/978-3-662-44851-9_35

URL : https://hal.archives-ouvertes.fr/hal-01060953

M. Pollatschek and B. Avi-itzhak, Algorithms for Stochastic Games with Geometrical Interpretation, Management Science, vol.15, issue.7, pp.399-415, 1969.
DOI : 10.1287/mnsc.15.7.399

H. Prasad, P. La, and S. Bhatnagar, Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games, Proc. of AAMAS, p.38, 2015.

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, p.53, 1994.
DOI : 10.1002/9780470316887

M. Riedmiller, Neural Fitted Q Iteration ??? First Experiences with a Data Efficient Neural Reinforcement Learning Method, Proc. of ECML. 2005. (? page 18
DOI : 10.1007/11564096_32

J. Robinson, An Iterative Method of Solving a Game, The Annals of Mathematics, vol.54, issue.2, pp.296-301, 1951.
DOI : 10.2307/1969530

B. Scherrer, Approximate Policy Iteration Schemes: A Comparison, Proc. of ICML, p.67, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00989982

B. Scherrer, Improved and Generalized Upper Bounds on the Complexity of Policy Iteration, Mathematics of Operations Research, vol.41, issue.3, pp.2016-2032
DOI : 10.1287/moor.2015.0753

URL : https://hal.archives-ouvertes.fr/hal-00921261

B. Scherrer and B. Lesner, On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes, Proc. of NIPS, pp.12-71, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00758809

B. Scherrer, M. Ghavamzadeh, V. Gabillon, and M. Geist, Approximate Modified Policy Iteration, Proc. of ICML, 2012. (? pages 21, pp.58-81
URL : https://hal.archives-ouvertes.fr/hal-00758882

L. S. Shapley, Stochastic Games, Proc. of the National Academy of Sciences of the United States of America, pp.11-171, 1953.

S. Shapley, Some topics in two-person games Advances in game theory, p.147, 1964.

Y. Shoham and K. Leyton-brown, Multiagent systems: Algorithmic, game-theoretic, and logical foundations, p.26, 2008.
DOI : 10.1017/CBO9780511811654

Y. Shoham, R. Powers, and T. Grenager, If multi-agent learning is the answer, what is the question?, Artificial Intelligence, vol.171, issue.7, pp.365-377, 2007.
DOI : 10.1016/j.artint.2006.02.006

G. Taylor and R. Parr, Value Function Approximation in Noisy Environments Using Locally Smoothed Regularized Approximate Linear Programs, Proc. of UAI, p.24, 2012.

J. Van-der and . Wal, Discounted Markov games: Generalized policy iteration method, Journal of Optimization Theory and Applications, vol.30, issue.1, pp.125-138, 1978.
DOI : 10.1007/BF00933260

C. J. Watkins and P. Dayan, Machine Learning, pp.279-292, 1992.

M. Zinkevich, A. Greenwald, and M. Littman, Cyclic Equilibria in Markov Games, Proc. of NIPS, pp.78-125, 2006.

M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione, Regret minimization in games with incomplete information, Proc. of NIPS, p.27, 2008.