, its performance relies on the consistency of the parameter estimation which may or may not occur. We exhibit this behavior through numerical experiments and discuss it in Sec

Y. Abbasi-yadkori and C. Szepesvári, Regret bounds for the adaptive control of linear quadratic systems, In COLT, vol.21, issue.116, pp.1-26, 2011.

C. Abbasi-yadkori and . Szepesvári, Bayesian optimal control of smoothly parameterized systems, Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp.24-60, 2015.

D. Abbasi-yadkori, C. Pál, and . Szepesvári, Improved algorithms for linear stochastic bandits, Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS), p.16, 2011.

D. Abbasi-yadkori, C. Pál, and . Szepesvári, Online least squares estimation with self-normalized processes: An application to bandit problems. arXiv preprint arXiv:1102, p.16, 2011.

M. Abeille and A. Lazaric, Linear Thompson sampling revisited, AISTATS 2017-20th International Conference on Artificial Intelligence and Statistics, pp.2017-2046
DOI : 10.1214/17-EJS1341SI

URL : https://hal.archives-ouvertes.fr/hal-01493561

M. Abeille and A. Lazaric, Thompson sampling for linear-quadratic control problems, AISTATS, pp.2017-59
URL : https://hal.archives-ouvertes.fr/hal-01493564

E. Abeille, A. Serie, X. Lazaric, and . Brokmann, Lqg for portfolio optimization. arXiv preprint, p.105, 2016.

J. D. Abernethy, C. Lee, and A. Tewari, Fighting bandits with a new kind of smoothness, Advances in Neural Information Processing Systems 28, pp.2197-2205

G. Acosta and R. G. Durán, An optimal poincaré inequality in l 1 for convex domains, Proceedings of the american mathematical society, pp.195-202, 2004.

R. , Sample mean based index policies by O(log n) regret for the multi-armed bandit problem, Advances in Applied Probability, vol.32, issue.04, pp.1054-1078, 1995.
DOI : 10.1016/0196-8858(85)90002-8

S. Agrawal and N. Goyal, Analysis of thompson sampling for the multi-armed bandit problem, Proceedings of the 25th Annual Conference on Learning Theory (COLT), p.14, 2012.

S. Agrawal and N. Goyal, Thompson sampling for contextual bandits with linear payoffs. arXiv preprint arXiv:1209, pp.16-31, 2012.

, Bibliography

S. Agrawal and N. Goyal, Further optimal regret bounds for thompson sampling, Proceedings of AI&Stats, p.15, 2013.

S. Agrawal and R. Jia, Posterior sampling for reinforcement learning: worst-case regret bounds, pp.2017-2041

R. Almgren and N. Chriss, Optimal execution of portfolio transactions, The Journal of Risk, vol.3, issue.2, pp.5-40, 2001.
DOI : 10.21314/JOR.2001.041

J. Audibert, R. Munos, and C. Szepesvári, Tuning Bandit Algorithms in Stochastic Environments, ALT, pp.150-165, 2007.
DOI : 10.1093/biomet/25.3-4.285

URL : https://hal.archives-ouvertes.fr/inria-00203487

P. Auer and R. Ortner, Logarithmic online regret bounds for undiscounted reinforcement learning, Advances in Neural Information Processing Systems, pp.49-56, 2007.

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time analysis of the multi-armed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

P. Auer, N. Cesa-bianchi, Y. Freund, and R. E. Schapire, The Nonstochastic Multiarmed Bandit Problem, SIAM Journal on Computing, vol.32, issue.1, pp.48-77, 2002.
DOI : 10.1137/S0097539701398375

URL : http://homepages.math.uic.edu/%7Elreyzin/f14_mcs548/auer02.pdf

P. L. Bartlett and A. Tewari, REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs, Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence, p.22, 2009.

D. P. Bertsekas, Dynamic programming and optimal control, pp.20-127, 1995.

D. Bertsimas and A. W. Lo, Optimal control of execution costs, Journal of Financial Markets, vol.1, issue.1, pp.1-50, 1998.
DOI : 10.1016/S1386-4181(97)00012-8

URL : http://web.mit.edu/dbertsim/www/papers/Finance/Optimal%20control%20of%20execution%20costs.pdf

S. Bittanti and M. Campi, [untitled], Communications in Information and Systems, vol.6, issue.4, pp.299-320, 2006.
DOI : 10.4310/CIS.2006.v6.n4.a3

J. Bouchaud, J. Farmer, and F. Lillo, How markets slowly digest changes in supply and demand. arXiv.org, p.106, 2008.
DOI : 10.1016/b978-012374258-2.50006-3

URL : http://arxiv.org/pdf/0809.0822

X. Brokmann, J. Kockelkoren, J. Bouchau, and E. Sérié, Slow decay of impact in equity markets. Available at SSRN 2471528, pp.2014-106
DOI : 10.2139/ssrn.2471528

URL : http://arxiv.org/pdf/1407.3390

S. Bubeck and N. Cesa-bianchi, Regret analysis of stochastic and nonstochastic multiarmed bandit problems, Machine Learning, pp.1-122, 2012.
DOI : 10.1561/2200000024

URL : http://arxiv.org/pdf/1204.5721.pdf

S. Bubeck and C. Liu, Prior-free and prior-dependent regret bounds for Thompson Sampling, 2014 48th Annual Conference on Information Sciences and Systems (CISS), pp.638-646
DOI : 10.1109/CISS.2014.6814158

URL : http://www.princeton.edu/~sbubeck/NIPS13_BL.pdf

M. C. Campi and P. Kumar, Adaptive Linear Quadratic Gaussian Control: The Cost-Biased Approach Revisited, SIAM Journal on Control and Optimization, vol.36, issue.6, pp.1890-1907, 1998.
DOI : 10.1137/S0363012997317499

URL : http://black.csl.uiuc.edu/~prkumar/ps_files/adaptive_lqg_5.ps

O. Cappé, A. Garivier, O. Maillard, R. Munos, and G. Stoltz, Kullback?leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, pp.1516-1541

N. Cesa-bianchi and G. Lugosi, Prediction, learning, and games, 2006.
DOI : 10.1017/CBO9780511546921

S. Chang, P. C. Cosman, and L. B. Milstein, Chernoff-Type Bounds for the Gaussian Error Function, IEEE Transactions on Communications, vol.59, issue.11, pp.2939-2944
DOI : 10.1109/TCOMM.2011.072011.100049

O. Chapelle, L. Li, J. Shawe-taylor, R. S. Zemel, P. L. Bartlett et al., An empirical evaluation of thompson sampling, Advances in Neural Information Processing Systems 24, pp.2249-2257, 2011.

C. Chen and F. Qi, Completely monotonic function associated with the gamma functions and proof of wallis' inequality, Tamkang Journal of Mathematics, vol.36, issue.4, pp.303-307, 2005.

G. Chun-hua, Newtons method for discrete algebraic riccati equations when the closedloop matrix has eigenvalues on the unit circle, SIAM J. Matrix Anal. Appl, pp.279-294, 1998.

G. M. Constantinides, Multiperiod Consumption and Investment Behavior with Convex Transactions Costs, Management Science, vol.25, issue.11, pp.1127-1137, 1979.
DOI : 10.1287/mnsc.25.11.1127

V. Dani, T. P. Hayes, and S. M. Kakade, Stochastic linear optimization under bandit feedback, COLT, pp.355-366, 2008.

T. L. De-la-pena, Q. Lai, and . Shao, Self-normalized processes: Limit theory and statistical applications, p.17, 2009.
DOI : 10.1007/978-3-540-85636-8

J. Donier, J. Bonart, I. Mastromatteo, and J. Bouchaud, A fully consistent, minimal model for non-linear market impact. Minimal Model for Non-Linear Market Impact, pp.2014-106, 2014.
DOI : 10.2139/ssrn.2531917

URL : http://arxiv.org/pdf/1412.0141

E. F. Fama and K. R. French, Common risk factors in the returns on stocks and bonds, Journal of Financial Economics, vol.33, issue.1, pp.3-56, 1993.
DOI : 10.1016/0304-405X(93)90023-5

URL : http://www.nes.ru/~agoriaev/Papers/Fama-French%205%20factors%20for%20stocks%20and%20vonds%20JFE93.pdf

, Bibliography

S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári, Parametric bandits: The generalized linear case, Advances in Neural Information Processing Systems, pp.586-594, 2010.

N. Gârleanu, Portfolio choice and pricing in illiquid markets, Journal of Economic Theory, vol.144, issue.2, pp.532-564, 2009.
DOI : 10.1016/j.jet.2008.07.006

N. Gârleanu and L. Pedersen, Dynamic Trading with Predictable Returns and Transaction Costs, The Journal of Finance, vol.13, issue.6, pp.2309-2340
DOI : 10.1007/s001990050268

J. Gatheral, No-dynamic-arbitrage and market impact, Quantitative Finance, vol.8, issue.7, pp.749-759, 2010.
DOI : 10.1080/14697680500244411

A. Gopalan and S. Mannor, Thompson sampling for learning parameterized markov decision processes, Proceedings of The 28th Conference on Learning Theory, pp.2015-2039

R. Grinold, Signal weighting. The Journal of Portfolio Management, pp.24-34

O. Guéant, Optimal execution and block trade pricing: a general framework. arXiv preprint arXiv:1210, pp.2012-106

G. Huberman and W. Stanzl, Price Manipulation and Quasi-Arbitrage, Econometrica, vol.72, issue.4, pp.1247-1275, 2004.
DOI : 10.1111/j.1468-0262.2004.00531.x

V. Ionescu, C. Oara, and M. Weiss, General matrix pencil techniques for the solution of algebraic Riccati equations: a unified approach, IEEE Transactions on Automatic Control, vol.42, issue.8, pp.1085-1097, 1997.
DOI : 10.1109/9.618238

T. Jaksch, R. Ortner, and P. Auer, Near-optimal regret bounds for reinforcement learning, J. Mach. Learn. Res, vol.11, issue.66, pp.1563-1600, 2010.

M. C. Jensen, F. Black, and M. S. Scholes, The capital asset pricing model: Some empirical tests, p.106, 1972.

K. Jun, A. Bhargava, R. Nowak, and R. Willett, Scalable generalized linear bandits: Online computation and hashing. arXiv preprint, pp.2017-2037

J. Kallsen and J. Muhle-karbe, The general structure of optimal investment and consumption with small transaction costs. Swiss Finance Institute Research Paper, pp.13-15

E. Kaufmann, N. Korda, and R. Munos, Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis, Proceedings of the 23rd International Conference on Algorithmic Learning Theory, pp.199-213, 2012.
DOI : 10.1007/978-3-642-34106-9_18

URL : https://hal.archives-ouvertes.fr/hal-02286442

J. Klamka, Controllability of dynamical systems, Mathematica Applicanda, vol.36, issue.50/09, pp.57-75
DOI : 10.14708/ma.v36i50/09.1502

N. Korda, E. Kaufmann, and R. Munos, Thompson sampling for 1-dimensional exponential family bandits, Advances in Neural Information Processing Systems 26, pp.1448-1456

S. G. Krantz and H. R. Parks, The implicit function theorem: history, theory, and applications, pp.2012-92

P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification, and adaptive control, SIAM, pp.2015-114
DOI : 10.1137/1.9781611974263

A. Kyle, Continuous Auctions and Insider Trading, Econometrica, vol.53, issue.6, pp.1315-1335, 1985.
DOI : 10.2307/1913210

T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol.6, issue.1, pp.4-22, 1985.
DOI : 10.1016/0196-8858(85)90002-8

URL : https://doi.org/10.1016/0196-8858(85)90002-8

P. Lancaster and L. Rodman, Algebraic riccati equations, pp.25-110, 1995.

J. D. Lataillade, C. Deremble, M. Potters, and J. Bouchaud, Optimal trading with linear costs. arXiv preprint, pp.2012-106

T. Lattimore and C. Szepesvari, The end of optimism? an asymptotic analysis of finite-armed linear bandits, Artificial Intelligence and Statistics, pp.728-737

A. J. Laub, A schur method for solving algebraic riccati equations Automatic Control, IEEE Transactions on, vol.24, issue.6, pp.913-921, 1979.
DOI : 10.1109/tac.1979.1102178

URL : http://dspace.mit.edu/bitstream/1721.1/1301/1/R-0859-05666488.pdf

A. J. Laub, Invariant Subspace Methods for the Numerical Solution of Riccati Equations, The Riccati Equation, pp.163-196, 1991.
DOI : 10.1007/978-3-642-58223-3_7

L. Li, W. Chu, J. Langford, and R. E. Schapire, A contextual-bandit approach to personalized news article recommendation, Proceedings of the 19th international conference on World wide web, WWW '10, pp.661-670
DOI : 10.1145/1772690.1772758

URL : http://www.cs.rutgers.edu/~lihong/pub/Li10Contextual.pdf

L. Li, Y. Lu, and D. Zhou, Provable optimal algorithms for generalized linear contextual bandits

S. Li, Concise Formulas for the Area and Volume of a Hyperspherical Cap, Asian Journal of Mathematics & Statistics, vol.4, issue.1, pp.66-70
DOI : 10.3923/ajms.2011.66.70

URL : https://scialert.net/qredirect.php?doi=ajms.2011.66.70&linkid=pdf

O. Maillard, R. Munos, and G. Stoltz, A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. arXiv preprint, p.12, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00574987

, Bibliography

H. Markowitz, Portfolio selection*. The journal of finance, pp.77-91, 1952.

I. Mastromatteo, B. Toth, and J. Bouchaud, Agent-based models for latent liquidity and concave price impact, Physical Review E, vol.25, issue.4, pp.42805-2014
DOI : 10.1080/14697688.2012.756146

B. C. May, N. Korda, A. Lee, and D. S. Leslie, Optimistic bayesian sampling in contextual-bandit problems, The Journal of Machine Learning Research, vol.13, issue.1, pp.2069-2106

B. P. Molinari, The stabilizing solution of the discrete algebraic riccati equation Automatic Control, IEEE Transactions on, vol.20, issue.126, pp.396-399, 1975.

L. Moreau, J. Muhle-karbe, and H. M. Soner, Trading with small price impact. Swiss Finance Institute Research Paper, pp.14-17

A. J. Morton and S. R. Pliska, OPTIMAL PORTFOLIO MANAGEMENT WITH FIXED TRANSACTION COSTS, Mathematical Finance, vol.15, issue.4, pp.337-356, 1995.
DOI : 10.1016/0022-0531(71)90038-X

C. Niculescu and L. Persson, Convex functions and their applications: a contemporary approach, p.56, 2006.

A. Obizhaeva and J. Wang, Optimal trading strategy and supply/demand dynamics, Journal of Financial Markets, vol.16, issue.1, pp.1-32
DOI : 10.1016/j.finmar.2012.09.001

I. Osband and B. V. Roy, Near-optimal reinforcement learning in factored mdps, Advances in Neural Information Processing Systems 27, pp.604-612

I. Osband and B. Van-roy, Model-based reinforcement learning and the eluder dimension, Advances in Neural Information Processing Systems 27, pp.1466-1474

I. Osband and B. Van-roy, Posterior sampling for reinforcement learning without episodes. arXiv preprint, pp.59-61

I. Osband and B. Van-roy, On optimistic versus randomized exploration in reinforcement learning, pp.2017-2041

I. Osband, B. Van-roy, and D. Russo, (more) efficient reinforcement learning via posterior sampling, Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS'13, pp.3003-3011

B. Park and B. Van-roy, Adaptive Execution: Exploration and Learning of Price Impact, Operations Research, vol.63, issue.5, pp.1058-1076, 2015.
DOI : 10.1287/opre.2015.1415

L. E. Payne and H. F. Weinberger, An optimal poincaré inequality for convex domains Archive for Rational Mechanics and Analysis, pp.286-292, 1960.
DOI : 10.1007/bf00252910

J. W. Polderman, On the necessity of identifying the true parameter in adaptive LQ control, Systems & Control Letters, vol.8, issue.2, pp.87-91, 1986.
DOI : 10.1016/0167-6911(86)90065-4

M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming, pp.2014-2036
DOI : 10.1002/9780470316887

D. Russo, D. Tse, and B. Van-roy, Time-sensitive bandit learning and satisficing thompson sampling. arXiv preprint, pp.2017-2032

W. F. Sharpe, Capital asset prices: A theory of market equilibrium under conditions of risk*. The journal of finance, pp.425-442, 1964.

M. J. Strens, A bayesian framework for reinforcement learning, Proceedings of the Seventeenth International Conference on Machine Learning, ICML '00, pp.943-950, 2000.

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, p.20, 1998.
DOI : 10.1109/TNN.1998.712192

M. Taksar, M. J. Klass, and D. Assaf, A Diffusion Model for Optimal Portfolio Selection in the Presence of Brokerage Fees, Mathematics of Operations Research, vol.13, issue.2, pp.277-294, 1988.
DOI : 10.1287/moor.13.2.277

W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol.30, pp.285-294, 1933.

P. Van-dooren, A Generalized Eigenvalue Approach for Solving Riccati Equations, SIAM Journal on Scientific and Statistical Computing, vol.2, issue.2, pp.121-135, 1981.
DOI : 10.1137/0902010

Y. Wang, J. Audibert, and R. Munos, Algorithms for infinitely many-armed bandits, Advances in Neural Information Processing Systems, pp.1729-1736, 2009.

H. K. Wimmer, On the algebraic Riccati equation, Bulletin of the Australian Mathematical Society, vol.72, issue.03, pp.441-452, 1984.
DOI : 10.1137/0125020