, We prove an extended result that holds when ? · ? U and ? · ? V are more general mixed (? × ? )-norms

H. Bibliography and . Akaike, Information theory and an extension of the maximum likelihood principle, pp.199-213, 1998.

G. Arfken and . Divergence, Mathematical Methods for Physicists, pp.37-42, 1985.

A. Argyriou, T. Evgeniou, and M. Pontil, Convex multi-task feature learning. Machine Learning, vol.73, pp.243-272, 2008.

D. Babichev and F. Bach, Constant step size stochastic gradient descent for probabilistic modeling, Proceedings in Uncertainty in Artificial Intelligence, pp.219-228, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01929810

D. Babichev and F. Bach, Slice inverse regression with score functions, Electronic Journal of Statistics, vol.12, issue.1, pp.1507-1543, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01388498

F. Bach, Sharp analysis of low-rank kernel matrix approximations, Conference on Learning Theory, pp.185-209, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00723365

F. Bach and E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, Adv. NIPS, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00608041

F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate (1/ ), Advances in Neural Information Processing Systems (NIPS), 2013.
URL : https://hal.archives-ouvertes.fr/hal-00831977

H. H. Bauschke, J. Bolte, and M. Teboulle, A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications, Mathematics of Operations Research, vol.42, issue.2, pp.330-348, 2016.

A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Operations Research Letters, vol.31, issue.3, pp.167-175, 2003.

A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences, vol.2, issue.1, pp.183-202, 2009.

S. Ben-david, N. Eiron, and P. M. Long, On the difficulty of approximately maximizing agreements, Journal of Computer and System Sciences, vol.66, issue.3, pp.496-514, 2003.

D. P. Bertsekas, Nonlinear programming. Athena scientific Belmont, 1999.

C. M. Bishop, Pattern Recognition and Machine Learning, 2006.

M. Blondel, A. F. Martins, and V. Niculae, Learning classifiers with fenchelyoung losses: Generalized entropies, margins, and algorithms, 2018.

A. Bordes, S. Ertekin, J. Weston, and L. Bottou, Fast kernel classifiers with online and active learning, Journal of Machine Learning Research, vol.6, pp.1579-1619, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00752361

J. Borwein and A. S. Lewis, Convex analysis and nonlinear optimization: theory and examples, 2010.

L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, 2016.

S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00794821

D. R. Brillinger, A Generalized Linear Model with 'Gaussian' Regressor Variables, 1982.

S. Bubeck, Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, vol.8, pp.231-357, 2015.

S. Cambanis, G. Huang, and . Simons, On the Theory of Elliptically Contoured Distributions, Journal of Multivariate Analysis, vol.11, issue.3, pp.368-385, 1981.

A. Caponnetto and E. Vito, Optimal rates for the regularized least-squares algorithm, Foundations of Computational Mathematics, vol.7, issue.3, pp.331-368, 2007.

G. Casella and R. L. Berger, Statistical inference, vol.2, 2002.

C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants et al., Robinson. One billion word benchmark for measuring progress in statistical language modeling, 2013.

K. L. Clarkson, E. Hazan, and D. P. Woodruff, Sublinear optimization for machine learning, Journal of the ACM (JACM), vol.59, issue.5, p.23, 2012.

R. D. Cook, Save: a method for dimension reduction and graphics in regression, Communications in Statistics -Theory and Methods, vol.29, pp.2109-2121, 2000.

R. D. Cook and H. Lee, Dimension Reduction in Binary Response Regression, Journal of the American Statistical Association, vol.94, pp.1187-1200, 1999.

R. D. Cook and S. Weisberg, Discussion of 'Sliced Inverse Regression, Journal of the American Statistical Association, vol.86, pp.328-332, 1991.

C. Cortes and V. Vapnik, Support-vector networks, Machine learning, vol.20, issue.3, pp.273-297, 1995.

A. S. Dalalyan, A. Juditsky, and V. Spokoiny, A new algorithm for estimating the effective dimension-reduction subspace, Journal of Machine Learning Research, vol.9, pp.1647-1678, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00128129

A. Defazio, F. Bach, and S. Lacoste-julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in neural information processing systems, pp.1646-1654, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01016843

A. Dieuleveut and F. Bach, Nonparametric stochastic approximation with large stepsizes, Ann. Statist, vol.44, issue.4, pp.1363-1399, 2016.

A. Dieuleveut, A. Durmus, and F. Bach, Bridging the gap between constant step size stochastic gradient descent and markov chains, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01565514

W. F. Donoghue and J. , Monotone Matrix Functions and Analytic Continuation, 1974.

N. Duan and K. Li, Slicing regression: a link-free regression method, The Annals of Statistics, vol.19, pp.505-530, 1991.

J. Duchi, S. Shalev-shwartz, Y. Singer, and A. Tewari, Composite objective mirror descent, COLT, pp.14-26, 2010.

V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu, Agnostic learning of monomials by halfspaces is hard, SIAM Journal on Computing, vol.41, issue.6, pp.1558-1590, 2012.

K. Fukumizu, F. R. Bach, and M. I. Jordan, Kernel dimension reduction in regression, The Annals of Statistics, vol.37, issue.4, pp.1871-1905, 2009.

D. Garber and E. Hazan, Approximating semidefinite programs in sublinear time, Advances in Neural Information Processing Systems, pp.1080-1088, 2011.

D. Garber and E. Hazan, Sublinear time algorithms for approximate semidefinite programming, Mathematical Programming, vol.158, issue.1-2, pp.329-361, 2016.

W. R. Gilks, S. Richardson, and D. Spiegelhalter, Markov chain Monte Carlo in practice, 1995.

A. S. Goldberger, Econometric theory. Econometric theory, 1964.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, 2016.

M. D. Grigoriadis and L. G. Khachiyan, A sublinear-time randomized approximation algorithm for matrix games, Operations Research Letters, vol.18, issue.2, pp.53-58, 1995.

L. Györfi, M. Kohler, A. Krzyzak, and H. Walk, A distribution-free theory of nonparametric regression. Springer series in statistics, 2002.

L. P. Hansen, Large sample properties of generalized method of moments estimators, Econometrica: Journal of the Econometric Society, pp.1029-1054, 1982.

E. Hazan, T. Koren, and N. Srebro, Beating SGD: Learning SVMs in sublinear time, Advances in Neural Information Processing Systems, pp.1233-1241, 2011.

N. He, A. Juditsky, and A. Nemirovski, Mirror prox algorithm for multiterm composite minimization and semi-separable problems, Computational Optimization and Applications, vol.61, issue.2, pp.275-319, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01335905

J. M. Hilbe, Negative binomial regression, 2011.

J. Hooper, Simultaneous Equations and Canonical Correlation Theory, Econometrica, vol.27, pp.245-256, 1959.

J. L. Horowitz, Semiparametric methods in econometrics, vol.131, 2012.

M. Hristache, A. Juditsky, and V. Spokoiny, Direct estimation of the index coefficient in a single index model, The Annals of Statistics, vol.29, issue.3, pp.595-623, 2001.

T. Hsing and R. J. Carroll, An asymptotic theory for sliced inverse regression, The Annals of Statistics, vol.20, issue.2, pp.1040-1061, 1992.

A. Hyvärinen, Estimation of non-normalized statistical models by score matching, Journal of Machine Learning Research, vol.6, pp.695-709, 2005.

A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis, vol.46, 2004.

M. Janzamin, H. Sedghi, and A. Anandkumar, Score function features for discriminative learning: Matrix and tensor framework, 2014.

M. Janzamin, H. Sedghi, and A. Anandkumar, Generalization Bounds for Neural Networks through Tensor Factorization, 2015.

R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems, pp.315-323, 2013.

A. Juditsky and A. Nemirovski, First-order methods for nonsmooth convex large-scale optimization, I: General purpose methods. Optimization for Machine Learning, pp.121-148, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00981863

A. Juditsky and A. Nemirovski, First order methods for nonsmooth convex large-scale optimization, ii: utilizing problems structure. Optimization for Machine Learning, pp.149-183, 2011.

A. Juditsky, A. Nemirovski, and C. Tauvel, Solving variational inequalities with stochastic mirror-prox algorithm, Stochastic Systems, vol.1, issue.1, pp.17-58, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00318043

H. B. Johannes and . Kemperman, On the optimum rate of transmitting information, Probability and information theory, pp.126-169, 1969.

D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques -Adaptive Computation and Machine Learning, 2009.

G. M. Korpelevich, Extragradient method for finding saddle points and other problems, Matekon, vol.13, issue.4, pp.35-49, 1977.

J. Lafferty, A. Mccallum, and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proc. ICML, 2001.

G. Lan, An optimal method for stochastic composite optimization, Mathematical Programming, vol.133, issue.1-2, pp.365-397, 2012.

B. Laurent and P. Massart, Adaptive estimation of a quadratic functional by model selection, The Annals of Statistics, vol.28, issue.5, pp.1302-1338, 2000.

L. Lecam, On some asymptotic properties of maximum likelihood estimates and related bayes estimates, Univ. California Pub. Statist, vol.1, pp.277-330, 1953.

E. L. Lehmann and G. Casella, Theory of point estimation, 2006.

K. Li, Sliced Inverse Regression for Dimensional Reduction, Journal of the American Statistical Association, vol.86, pp.316-327, 1991.

K. Li, On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma, Journal of the American Statistical Association, vol.87, pp.1025-1039, 1992.

K. Li and N. Duan, Regression analysis under link violation, The Annals of Statistics, vol.17, pp.1009-1052, 1989.

M. Lichman, UCI machine learning repository, 2013.

Q. Lin, Z. Zhao, and J. S. Liu, On consistency and sparsity for sliced inverse regression in high dimensions, The Annals of Statistics, vol.46, issue.2, pp.580-610, 2018.

P. Mccullagh, Generalized linear models, European Journal of Operational Research, vol.16, issue.3, pp.285-292, 1984.

P. Mccullagh and J. A. Nelder, Generalized linear models, vol.37, 1989.

A. M. Mcdonald, M. Pontil, and S. Stamos, Spectral -support norm regularization, Advances in Neural Information Processing Systems, 2014.

S. P. Meyn and R. L. Tweedie, Markov chains and stochastic stability, 1993.

J. Moreau, Proximité et dualité dans un espace hilbertien, Bull. Soc. Math. France, vol.93, issue.2, pp.273-299, 1965.

K. P. Murphy, Machine Learning: A Probabilistic Perspective, 2012.

A. Nemirovski, Prox-method with rate of convergence (1/ ) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems, SIAM Journal on Optimization, vol.15, issue.1, pp.229-251, 2004.

A. Nemirovski, U. G. Onn, and S. Rothblum, Accuracy certificates for computational problems with convex structure, Mathematics of Operations Research, vol.35, issue.1, pp.52-78, 2010.

A. Nemirovsky and D. Yudin, Problem complexity and method efficiency in optimization, 1983.

Y. Nesterov, Smooth minimization of non-smooth functions. Mathematical programming, vol.103, pp.127-152, 2005.

Y. Nesterov, Gradient methods for minimizing composite objective function, 2007.

Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol.87, 2013.

Y. Nesterov and A. Nemirovski, On first-order algorithms for 1 /nuclear norm minimization, Acta Numerica, vol.22, pp.509-575, 2013.

Y. E. Nesterov, A method for solving the convex programming problem with convergence rate o (1/k?2), In Dokl. Akad. Nauk SSSR, vol.269, pp.543-547, 1983.

D. Ostrovskii and Z. Harchaoui, Efficient first-order algorithms for adaptive signal denoising, Proceedings of the 35th ICML conference, vol.80, pp.3946-3955, 2018.

B. Palaniappan and F. Bach, Stochastic variance reduction methods for saddle-point problems, Advances in Neural Information Processing Systems, pp.1416-1424, 2016.

I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras et al., LSHTC: A benchmark for large-scale text classification, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01691460

B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM Journal on Control and Optimization, vol.30, issue.4, pp.838-855, 1992.

C. E. Rasmussen and C. K. Williams, Gaussian Processes for Machine Learning, 2006.

H. Robbins and S. Monro, ªa stochastic approximation method, º annals math, Statistics, vol.22, pp.400-407, 1951.

R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM journal on control and optimization, vol.14, issue.5, pp.877-898, 1976.

R. T. Rockafellar, Convex analysis, 2015.

A. Rudi, L. Carratino, and L. Rosasco, Falkon: An optimal large scale kernel method, Advances in Neural Information Processing Systems, pp.3891-3901, 2017.

M. Schmidt, N. L. Roux, and F. Bach, Minimizing finite sums with the stochastic average gradient, Mathematical Programming, vol.162, issue.1-2, pp.83-112, 2017.
URL : https://hal.archives-ouvertes.fr/hal-00860051

B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and beyond, 2001.

S. Shalev-shwartz and S. Ben-david, Understanding machine learning: From theory to algorithms, 2014.

S. Shalev-shwartz and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization, Journal of Machine Learning Research, vol.14, pp.567-599, 2013.

S. Shalev-shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical programming, vol.127, pp.3-30, 2011.

A. Shapiro, D. Dentcheva, and A. Ruszczy?ski, Lectures on stochastic programming: modeling and theory, 2009.

J. Shawe-taylor and N. Cristianini, Kernel Methods for Pattern Analysis, 2004.

Z. Shi, X. Zhang, and Y. Yu, Bregman divergence for stochastic variance reduction: saddle-point and adversarial prediction, Advances in Neural Information Processing Systems, pp.6031-6041, 2017.

S. Sra, Fast projections onto mixed-norm balls with applications, Data Mining and Knowledge Discovery, vol.25, issue.2, pp.358-377, 2012.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Schölkopf, Injective hilbert space embeddings of probability measures, Proc. COLT, 2008.

C. M. Stein, Estimation of the Mean of a Multivariate Normal Distribution, The Annals of Statistics, vol.9, pp.1135-1151, 1981.

G. W. Stewart and J. Sun, Matrix perturbation theory (computer science and scientific computing), 1990.

T. M. Stoker, Consistent estimation of scaled coefficients, Econometrica, vol.54, pp.1461-1481, 1986.

A. B. Tsybakov, Introduction to Nonparametric Estimation, 2009.

V. Q. Vu and J. Lei, Minimax sparse principal subspace estimation in high dimensions, The Annals of Statistics, vol.41, issue.6, pp.2905-2947, 2013.

H. Wang and Y. Xia, On directional regression for dimension reduction, J. Amer. Statist. Ass. Citeseer, 2007.

H. Wang and Y. Xia, Sliced regression for dimension reduction, Journal of the American Statistical Association, vol.103, issue.482, pp.811-821, 2008.

C. K. Williams and M. Seeger, Using the nyström method to speed up kernel machines, Advances in neural information processing systems, pp.682-688, 2001.

D. P. Woodruff, Sketching as a tool for numerical linear algebra, Foundations and Trends® in Theoretical Computer Science, vol.10, issue.1-2, pp.1-157, 2014.

Y. Xia, H. Tong, W. K. Li, and L. Zhu, An adaptive estimation of dimension reduction space, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.64, issue.3, pp.363-410, 2002.

L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, vol.11, pp.2543-2596, 2010.

L. Xiao, A. W. Yu, Q. Lin, and W. Chen, Dscovr: Randomized primal-dual block coordinate algorithms for asynchronous distributed optimization, 2017.

S. S. Yang, General distribution theory of the concomitants of order statistics, The Annals of Statistics, vol.5, pp.996-1002, 1977.

Y. Yu, T. Wang, and R. J. Samworth, A useful variant of the davis-kahan theorem for statisticians, Biometrika, vol.102, issue.2, pp.315-323, 2015.

Y. Yu, The strong convexity of von Neumann's entropy. Unpublished note, 2013.

M. Yuan, On the identifiability of additive index models, Statistica Sinica, vol.21, issue.4, pp.1901-1911, 2011.

L. Zhu and K. W. Ng, Averaging estimators in red vs averaging predictions in green. * is optimal linear predictor and ** is the global optimum, Statistica Sinica, vol.5, pp.727-736, 1995.

, Averaging estimators in red vs averaging predictions in green. Global optimizer coincides with the best linear, p.68

=. and *. , = sin 1 +sin 2 . Excess prediction performance vs. number of iterations (both in logscale)

. .. , 73 3-9 MiniBooNE dataset, dimension = 50, kernel approach, column sampling = 200. Excess prediction performance vs. number of iterations (both in log-scale), performance vs. number of iterations (both in logscale), vol.74, pp.3-11

. .. Full-ss), 99 4-2 Primal accuracy and duality gap (when available) for Algorithm 1, stochastic subgradient method (SSM), and Mirror Prox (MP) with exact gradients, on a synthetic data benchmark, Depiction of the Full Sampling Scheme, p.108

. .. , Comparison of different methods using score functions

, Runtime (in seconds) of Algorithm 1 on synthetic data, p.107