S. Amari, Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, pp.251-276, 1998.
DOI : 10.1103/PhysRevLett.76.2188

A. Antos, C. Szepesvári, and R. Munos, Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Conference On Learning Theory (COLT 06), pp.574-588, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00830201

A. Antos, C. Szepesvári, and R. Munos, Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, pp.89-129, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00830201

P. Auer and R. Ortner, Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning, éditeurs : Advances in Neural Information Processing Systems (NIPS 19), pp.49-56, 2007.

C. Leemon and . Baird, Advantage Updating Rapport technique WL-TR-93-1146, 1993.

C. Leemon and . Baird, Residual Algorithms : Reinforcement Learning with Function Approximation, Proceedings of the International Conference on Machine Learning (ICML 95), pp.30-37, 1995.

R. Bellman, A Markovian Decision Process, Indiana University Mathematics Journal, vol.6, issue.4, pp.679-684, 1957.
DOI : 10.1512/iumj.1957.6.56038

R. Bellman, Dynamic Programming, 1957.

R. Bellman, R. Kalaba, and B. Kotkin, Polynomial approximation -a new computational technique in dynamic programming : allocation processes, Mathematical Computation, vol.17, pp.155-161, 1973.

D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, The Complexity of Decentralized Control of Markov Decision Processes, Mathematics of Operations Research, vol.27, issue.4, pp.819-840, 2002.
DOI : 10.1287/moor.27.4.819.297

P. Dimitri, Bertsekas : Dynamic Programming and Optimal Control, Athena Scientific, 1995.

P. Dimitri, . Bertsekas, and N. John, Tsitsiklis : Neuro-Dynamic Programming (Optimization and Neural Computation Series, 3), Athena Scientific, 1996.

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, Incremental natural actor-critic algorithms Natural Actor-Critic Algorithms, Conference on Neural Information Processing Systems (NIPS), 2007.

C. M. Bishop, Neural Networks for Pattern Recognition, 1995.

S. Vivek and . Borkar, Controlled diffusion processes, Probability Surveys, pp.213-244, 2005.

A. Justin, Boyan : Least-squares temporal difference learning, Proceedings of the 16th International Conference on Machine Learning (ICML 99), pp.49-56

J. A. Boyan, Technical Update : Least-Squares Temporal Difference Learning, Machine Learning, pp.233-246, 1999.

J. Steven, . Bradtke, and G. Andrew, Barto : Linear Least-Squares algorithms for temporal difference learning, Machine Learning, pp.33-57, 1996.

R. I. Brafman and M. Tennenholtz, R-max -A general polynomial time algorithm for near-optimal reinforcement learning, Journal of Machine Learning Research, vol.3, pp.213-231, 2002.

D. Choi-et-benjamin-van-roy, A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning. Discrete Event Dynamic Systems, pp.207-239, 2006.

R. Dearden, N. Friedman, and D. Andre, Model-Based Bayesian Exploration, Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), pp.150-165, 1999.

R. Dearden, N. Friedman, and S. J. Russell, Bayesian Q-Learning, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI), pp.761-768, 1998.

T. Degris, O. Sigaud, and P. Wuillemin, Chi-square Tests Driven Method for Learning the Structure of Factored MDPs, Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI 06), pp.122-129, 2006.
URL : https://hal.archives-ouvertes.fr/hal-01351133

C. Dimitrakakis and S. Bengio, Estimates of Parameter Distributions for Optimal Action Selection Rapport technique IDIAP-RR 04-72, Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP), 2005.

O. Michael and . Duff, Optimal learning : Computational preocedures for Bayesadaptive Markov decision processes, Thèse de doctorat, 2002.

Y. Engel, Algorithms and Representations for Reinforcement Learning, Thèse de doctorat, 2005.

Y. Engel, S. Mannor, and R. Meir, Bayes Meets Bellman : The Gaussian Process Approach to Temporal Difference Learning, Proceedings of the International Conference on Machine Learning (ICML 03), pp.154-161, 2003.

Y. Engel, S. Mannor, and R. Meir, The Kernel Recursive Least-Squares Algorithm, IEEE Transactions on Signal Processing, vol.52, issue.8, pp.2275-2285, 2004.
DOI : 10.1109/TSP.2004.830985

Y. Engel, S. Mannor, and R. Meir, Reinforcement learning with Gaussian processes, Proceedings of the 22nd international conference on Machine learning , ICML '05, 2005.
DOI : 10.1145/1102351.1102377

D. Ernst, P. Geurts, and L. Wehenkel, Tree-Based Batch Mode Reinforcement Learning, Journal of Machine Learning Research, vol.6, pp.503-556, 2005.

L. A. Feldkamp, T. M. Feldkamp, and D. V. Prokhorov, Neural network training with the nprKF, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), pp.109-114, 2001.
DOI : 10.1109/IJCNN.2001.939001

L. A. Feldkamp and G. V. , Puskorius : A signal processing framework based on dynamic neural networks with application to problems in adaptation, filtering, and classification, Proceedings of the IEEE, pp.2259-2277, 1998.

N. Ferns, P. Panangaden, and D. Precup, Metrics for Finite Markov Decision Processes, Proceedings of the 20th Annual Conference on Uncertainty in Artificial Intelligence (UAI 04), pp.162-178, 2004.

N. Ferns, P. Panangaden, and D. Precup, Metrics for Markov Decision Processes with Infinite State Spaces, Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI 05), p.201, 2005.

F. Norman, P. S. Ferns, D. Castro, P. Precup, and . Panangaden, Methods for computing state similarity in Markov Decision Processes, Proceedings of the 22nd Conference on Uncertainty in Artificial intelligence (UAI 06), 2006.

M. Geist, O. Pietquin, and G. Fricout, A Sparse Nonlinear Bayesian Online Kernel Regression, 2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences, pp.199-204, 2008.
DOI : 10.1109/ADVCOMP.2008.7
URL : https://hal.archives-ouvertes.fr/hal-00327081

M. Geist, O. Pietquin, and G. Fricout, Bayesian Reward Filtering, S. Girgin et al., éditeur : Proceedings of the European Workshop on Reinforcement Learning, pp.96-109, 2008.
DOI : 10.1007/978-3-540-89722-4_8
URL : https://hal.archives-ouvertes.fr/hal-00351282

M. Geist, O. Pietquin, and G. Fricout, Filtrage bayésien de la récompense, actes des Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, pp.113-122, 2008.

M. Geist, O. Pietquin, and G. Fricout, Kalman Temporal Differences : Uncertainty and Value Function Approximation, NIPS Workshop on Model Uncertainty and Risk in Reinforcement Learning, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00351298

M. Geist, O. Pietquin, and G. Fricout, Online Bayesian kernel regression from nonlinear mapping of observations, 2008 IEEE Workshop on Machine Learning for Signal Processing, 2008.
DOI : 10.1109/MLSP.2008.4685498
URL : https://hal.archives-ouvertes.fr/hal-00335052

M. Geist, O. Pietquin, and G. Fricout, Différences Temporelles de Kalman Décision et Apprentissage pour la conduite de systèmes, actes des Journées Francophones de Planification, 2009.

M. Geist, O. Pietquin, and G. Fricout, Différences Temporelles de Kalman : le cas stochastique, actes des Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, 2009.

M. Geist, O. Pietquin, and G. Fricout, From Supervised to Reinforcement Learning : a Kernel-based Bayesian Filtering Framework, International Journal On Advances in Software, vol.2, issue.1, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00429891

M. Geist, O. Pietquin, and G. Fricout, Kalman Temporal Differences: The deterministic case, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
DOI : 10.1109/ADPRL.2009.4927543
URL : https://hal.archives-ouvertes.fr/hal-00380870

M. Geist, O. Pietquin, and G. Fricout, Kernelizing Vector Quantization Algorithms, European Symposium on Artificial Neural Networks, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00429892

M. Geist, O. Pietquin, and G. Fricout, Tracking in Reinforcement Learning, Proceedings of the 16th International Conference on Neural Information Processing, 2009.
DOI : 10.1007/978-3-642-10677-4_57
URL : https://hal.archives-ouvertes.fr/hal-00439316

M. Geist, O. Pietquin, and G. Fricout, Astuce du Noyau & Quantification Vectorielle, Actes du 17ème colloque sur la Reconnaissance des Formes et l'Intelligence Artificielle (RFIA'10), 2010.
URL : https://hal.archives-ouvertes.fr/hal-00553114

A. Geramifard, M. Bowling, and R. S. Sutton, Incremental Least- Squares Temporal Difference Learning, 21st Conference of American Association for Artificial Intelligence (AAAI 06), pp.356-361, 2006.

Z. Ghahramani, Learning dynamic Bayesian networks, Adaptive Processing of Sequences and Data Structures, International Summer School on Neural Networks, "E.R. Caianiello"-Tutorial Lectures, pp.168-197, 1998.
DOI : 10.1007/BFb0053999
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.462.2249

G. Gordon, Stable Function Approximation in Dynamic Programming, Proceedings of the International Conference on Machine Learning (IMCL 95), 1995.
DOI : 10.1016/B978-1-55860-377-6.50040-2

S. Mohinder, A. P. Grewal, and . Andrew, Kalman Filtering : Theory and Practice, 1993.

H. Hachiya, T. Akiyama, M. Sugiayma, and J. Peters, Adaptive importance sampling for value function approximation in off-policy reinforcement learning, Neural Networks, vol.22, issue.10, 2009.
DOI : 10.1016/j.neunet.2009.01.002

R. A. Howard, Dynamic Programming and Markov Processes, p.3, 1960.

S. E. , J. , and S. W. Kim, Consistent Normalized Least Mean Square Filtering with Noisy Data Matrix, IEEE Transactions on Signal Processing, vol.53, issue.6, pp.2112-2123, 2005.

S. J. Julier and J. K. , Uhlmann : Unscented filtering and nonlinear estimation, Proceedings of the IEEE, pp.401-422, 2004.

J. Simon and . Julier, The scaled unscented transformation, American Control Conference, pp.4555-4559, 2002.

J. Simon, . Julier, and K. Jeffrey, Uhlmann : A new extension of the Kalman filter to nonlinear systems, Int. Symp. Aerospace/Defense Sensing, Simul. and Controls 3, 1997.

T. Jung and P. Stone, Feature Selection for Value Function Approximation Using Bayesian Model Selection, Proceedings of European Conference on Machine Learning, 2009.
DOI : 10.1109/5.58326

L. Pack and K. , Learning in embedded systems, 1993.

L. Pack-kaelbling, M. L. Littman, and A. R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence, vol.101, issue.1-2, pp.99-134, 1998.
DOI : 10.1016/S0004-3702(98)00023-X

S. Kakade, M. J. Kearns, and J. Langford, Exploration in Metric State Spaces, International Conference on Machine Learning (ICML 03), pp.306-312, 2003.

M. Kearns and S. Singh, Near-Optimal Reinforcement Learning in Polynomial Time, Machine Learning, pp.209-232, 2002.

W. Philipp, S. Keller, D. Mannor, and . Precup, Automatic basis function construction for approximate dynamic programming and reinforcement learning, Proceedings of the 23rd international conference on Machine learning (ICML 06), pp.449-456, 2006.

C. Kim, Time-varying parameter models with endogenous regressors, Economics Letters, vol.91, issue.1, pp.21-26, 2006.
DOI : 10.1016/j.econlet.2005.10.007

H. Kimura and S. Kobayashi, An Analysis of Actor-Critic Algorithms Using Eligibility Traces : Reinforcement Learning with Imperfect Value Function, Proceedings of the Fifteenth International Conference on Machine Learning (ICML 98), pp.278-286, 1998.

J. , Z. Kolter, and Y. Andrew, Ng : Near-Bayesian Exploration in Polynomial Time, Proceedings of the 26th international conference on Machine learning (ICML 09), 2009.

R. V?ay, J. N. Konda, and . Tsitsikli, Actor-Critic Algorithms, Advances in Neural Information Processing Systems (NIPS 12, 2000.

R. V?ay, . Konda, and N. John, Tsitsiklis : On Actor-Critic Algorithms, SIAM Journal on Control and Optimization, vol.42, issue.4, pp.1143-1166, 2003.

G. Michail, R. Lagoudakis, and . Parr, Least-squares policy iteration, Journal of Machine Learning Research, vol.4, pp.1107-1149, 2003.

B. R. Leffler, M. L. Littman, A. L. Strehl, and T. J. Walsh, Efficient Exploration With Latent Structure, Robotics: Science and Systems I, 2005.
DOI : 10.15607/RSS.2005.I.011

L. Li, M. L. Littman, and C. R. Mansley, Online exploration in least-squares policy iteration, Proceedings of the Conference for research in autonomous agents and multi-agent systems (AAMAS-09), 2009.

L. Michael and . Littman, The Witness Algorithm : Solving Partially Observable Markov Decision Processes, 1994.

L. Michael, A. R. Littman, L. Cassandra, and . Kaelbling, Efficient dynamic-programming updates in partially observable markov decision processes . Rapport technique CS-95-19, 1995.

S. Mahadevan and M. Maggioni, Proto-value Function : A Laplacian Framework for Learning Representation and Control in Markov Decisions, 2006.

T. Morimura, E. Uchibe, and K. Doya, Utilizing the Natural Gradient in Temporal Difference Reinforcement Learning with Eligibility Traces, 2nd Internatinal Symposium on Information Geometry and its Applications, pp.256-263, 2005.

D. Ormoneit and S. Sen, Kernel-Based Reinforcement Learning, Machine Learning, pp.161-178, 2002.

J. Peters and S. Schaal, Natural Actor-Critic, Neurocomputing, vol.71, issue.7-9, pp.1180-1190, 2008.
DOI : 10.1016/j.neucom.2007.11.026

J. Peters, S. V?ayakumar, and S. Schaal, Natural Actor-Critic, éditeur : Proceedings of the European Conference on Machine Learning (ECML 2005), Lecture Notes in Artificial Intelligence

J. Peters, S. V?ayakumar, and S. Schaal, Reinforcement Learning for Humanoid Robotics, third ieee-ras international conference on humanoid robots, 2003.

P. Chee-wee and R. Fitch, Tracking value function dynamics to improve reinforcement learning with piecewise linear function approximation, Proceedings of the International Conference on Machine Learning (ICML 07), 2007.

P. Poupart, N. Vlassis, J. Hoey, and K. Regan, An analytic solution to discrete Bayesian reinforcement learning, Proceedings of the 23rd international conference on Machine learning , ICML '06, pp.697-704, 2006.
DOI : 10.1145/1143844.1143932

D. Precup, R. S. Sutton, P. Satinder, and . Singh, Eligibility Traces for Off- Policy Policy Evaluation, Proceedings of the Seventeenth International Conference on Machine Learning (ICML 00), pp.759-766, 2000.

P. Preux, S. Girgin, and M. Loth, Feature discovery in approximate dynamic programming, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
DOI : 10.1109/ADPRL.2009.4927533
URL : https://hal.archives-ouvertes.fr/hal-00351144

G. V. Puskorius and L. A. , Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks, IEEE Transactions on Neural Networks, vol.5, issue.2, pp.279-297, 1994.
DOI : 10.1109/72.279191

G. V. Puskorius and L. A. Feldkamp, Roles of learning rates, artificial process noise and square rootfiltering for extended Kalman filter training, Proceedings of the International Joint Conference on Neural Networks (?CNN 99, pp.1809-1814, 1999.

L. Martin and . Puterman, Markov Decision Processes : Discrete Stochastic Dynamic Programming, 1994.

C. Edward-rassmussen-et-christopher and K. I. Williams, Gaussian Processes for Machine Learning, 2006.

I. Rivals and L. Personnaz, A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models, Neurocomputing, vol.20, issue.1-3, pp.279-294, 1998.
DOI : 10.1016/S0925-2312(98)00021-6
URL : https://hal.archives-ouvertes.fr/hal-00797391

Y. Sakaguchi and M. Takano, Reliability of internal prediction/estimation and its application. I. Adaptive action selection reflecting reliability of value function, Neural Networks, vol.17, issue.7, pp.935-952, 2004.
DOI : 10.1016/j.neunet.2004.05.004

D. Schneegass, On the bias of batch Bellman residual minimisation, Neurocomputing, vol.72, issue.7-9, 2005.
DOI : 10.1016/j.neucom.2008.11.024

R. Schoknecht, Optimality of Reinforcement Learning Algorithms with Linear Function Approximation, Neural Information Processing Systems(NIPS 15), 2002.

B. Scholkopf and A. J. Smola, Learning with Kernels : Support Vector Machines, Regularization, Optimization, and Beyond, 2001.

W. Schultz, P. Dayan, and P. R. Montague, A Neural Substrate of Prediction and Reward, Science, vol.275, issue.5306, pp.1593-1599, 1997.
DOI : 10.1126/science.275.5306.1593

O. Sigaud and O. Buffet, Processus décisionnels de Markov en intelligence artificielle, 2008.

D. Simon, Optimal State Estimation : Kalman, H Infinity, and Nonlinear Approaches, 2006.
DOI : 10.1002/0470045345

H. W. Sorenson, Approximate solutions of the nonlinear filtering problem, 1977 IEEE Conference on Decision and Control including the 16th Symposium on Adaptive Processes and A Special Symposium on Fuzzy Set Theory and Applications, pp.620-625, 1977.
DOI : 10.1109/CDC.1977.271646

A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and L. Michael, Littman : PAC Model-Free Reinforcement Learning, 23rd International Conference on Machine Learning, pp.881-888, 2006.
DOI : 10.1145/1143844.1143955
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.326

L. Alexander, . Strehl, and L. Michael, Littman : An Empirical Evaluation of Interval Estimation for Markov Decision Processes, 16th IEEE International on Tools with Artificial Intelligence Conference, pp.128-135, 2004.

L. Alexander, . Strehl, and L. Michael, Littman : An Analysis of Model-Based Interval Estimation for Markov Decision Processes, Journal of Computer and System Sciences, 2006.

M. Strens, A Bayesian Framework for Reinforcement Learning, Proceedings of the 17th International Conference on Machine Learning, pp.943-950

S. Richard, A. G. Sutton, and . Barto, Reinforcement Learning : An Introduction (Adaptive Computation and Machine Learning), 1998.

R. S. Sutton, A. Koop, and D. Silver, On the role of tracking in stationary environments, Proceedings of the 24th international conference on Machine learning, ICML '07, pp.871-878, 2007.
DOI : 10.1145/1273496.1273606

R. S. Sutton, D. A. Mcallester, and P. Satinder, Singh et Yishay Mansour : Policy Gradient Methods for Reinforcement Learning with Function Approximation, Neural Information Processing Systems (NIPS), pp.1057-1063, 1999.

R. S. Sutton, D. Precup, P. Satinder, and . Singh, Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, Artificial Intelligence, vol.112, issue.1-2, pp.181-211, 1999.
DOI : 10.1016/S0004-3702(99)00052-1

P. Sykacek and S. Roberts, Adaptive Classification by Variational Kalman Filtering, Neural Information Processing Systems (NIPS 15), 2002.

T. Söderström and P. Stoica, Instrumental variable methods for system identification, Circuits, Systems, and Signal Processing, pp.1-9, 2002.
DOI : 10.1007/BFb0009019

G. Tesauro, Temporal difference learning and TD-Gammon, Mars, 1995.
DOI : 10.1145/203330.203343

R. William and . Thompson, On the likelihood that one unknown probability exceeds another in view of two samples, Biometrika, issue.25, pp.285-294, 1933.

E. Thorndike, Educational psychology : the psychology of learning, 1913.

N. John and . Tsitsiklis-et-benjamin-van-roy, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, pp.674-690, 1997.

T. Ueno, M. Kawanabe, T. Mori, S. Maeda, and S. Ishii, A semiparametric statistical approach to model-free policy evaluation, Proceedings of the 25th international conference on Machine learning, ICML '08, 2008.
DOI : 10.1145/1390156.1390291

R. Van-der-merwe, Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models, Thèse de doctorat, 2004.

R. Van-der-merwe, A. Nando-de-freitas, E. Doucet, and . Wan, The Unscented Particle Filter. Rapport technique CUED, 2000.

V. Vapnik, Statisical Learning Theory, 1998.

A. Eric, R. Wan, and . Van-der-merwe, The unscented Kalman filter for nonlinear estimation, Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The IEEE, pp.153-158, 2000.

T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans, Bayesian sparse sampling for on-line reward optimization, Proceedings of the 22nd international conference on Machine learning , ICML '05, pp.956-963, 2005.
DOI : 10.1145/1102351.1102472

C. Watkins, Learning from Delayed Rewards, Thèse de doctorat, 1989.

M. Wiering and J. Schmidhuber, Efficient model-based exploration, Proceedings of the fifth international conference on simulation of adaptive behavior on From animals to animats 5, pp.223-228, 1998.

M. Wiering and . Hado-van-hasselt, The QV family compared to other reinforcement learning algorithms, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
DOI : 10.1109/ADPRL.2009.4927532

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, pp.229-256, 1992.

H. Yu and D. P. Bertsekas, Basis function adaptation methods for cost approximation in MDP, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
DOI : 10.1109/ADPRL.2009.4927528

K. J. Åström, Optimal control of Markov processes with incomplete state information, Journal of Mathematical Analysis and Applications, vol.10, issue.1, pp.174-205, 1965.
DOI : 10.1016/0022-247X(65)90154-X