I. Suffix and T. Tries, 41 3.3.1 Suffix trie structured vectors 42 3.3.2 Word-specific suffix trie structured vectors, p.45

.. Models-with-unstructured-penalties, Performance evaluation with unstructured penalties, p.47

.. Models-with-structured-penalties, 48 3.5.1 Proximal projection with`Twith`with`T 2 -norm, p.49

O. Tree-embedding, 67 4.2.2 Split and merge operations, p.71

A. Agarwal, A. Rakhlin, and P. Bartlett, Matrix regularization techniques for online multitask learning

T. Alumae and M. Kurimo, Efficient estimation of maximum entropy language models with N-gram features: an SRILM extension, Proceedings of the Annual Conference of the International Speech Communication Association

Y. Amit, M. Fink, N. Srebro, and S. Ullman, Uncovering shared structures in multiclass classification, Proceedings of the 24th international conference on Machine learning, ICML '07
DOI : 10.1145/1273496.1273499

G. Andrew and J. Gao, Scalable training of`1of`of`1 -regularized log-linear models, Proceedings of the International Conference on Machine Learning

M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations
DOI : 10.1017/CBO9780511624216

A. Argyriou, T. Evgeniou, and M. Pontil, Multi-task feature learning, Advances in Neural Information Processing Systems

A. Argyriou, T. Evgeniou, and M. Pontil, Convex multi-task feature learning, Journal of Machine Learning, vol.3, issue.3

R. Babbar, I. Partalas, É. Gaussier, and M. Amini, On flat versus hierarchical classification in large-scale taxonomies, Advances in Neural Information Processing Systems, 2013.
URL : https://hal.archives-ouvertes.fr/hal-01118815

F. Bach, Bolasso, Proceedings of the 25th international conference on Machine learning, ICML '08
DOI : 10.1145/1390156.1390161

URL : https://hal.archives-ouvertes.fr/hal-00271289

F. Bach, Consistency of the group lasso and multiple kernel learning, Journal of Machine Learning Research, vol.9
URL : https://hal.archives-ouvertes.fr/hal-00164735

F. Bach and Z. Harchaoui, Diffrac: a discriminative and flexible framework for clustering, Advances in Neural Information Processing Systems

F. Bach and M. Jordan, Predictive low-rank decomposition for kernel methods, Proceedings of the 22nd international conference on Machine learning , ICML '05
DOI : 10.1145/1102351.1102356

F. Bach, J. Mairal, and J. Ponce, Convex sparse matrix factorizations, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00345747

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with Sparsity-Inducing Penalties, Machine Learning
DOI : 10.1561/2200000015

URL : https://hal.archives-ouvertes.fr/hal-00613125

L. R. Bahl, F. Jelinek, and R. L. Mercer, A maximum likelihood approach to continuous speech recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.5, issue.2, pp.179-190, 1983.

S. Banerjee and A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization

R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, Model-Based Compressive Sensing, IEEE Transactions on Information Theory, vol.56, issue.4
DOI : 10.1109/TIT.2010.2040894

E. Bart, M. Welling, and P. Perona, Unsupervised Organization of Image Collections: Taxonomies and Beyond, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, issue.11, pp.2302-2315, 2011.
DOI : 10.1109/TPAMI.2011.79

P. L. Bartlett, M. I. Jordan, and J. D. Mcauliffe, Large margin classifiers: convex loss, low noise, and convergence rates, Advances in Neural Information Processing Systems, 2003.

A. Beck and M. Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems, SIAM Journal on Imaging Sciences, vol.2, issue.1
DOI : 10.1137/080716542

T. C. Bell, J. G. Clearly, and H. Witten, Text compression

M. Ben-akiva and S. R. Lerman, Discrete Choice Analysis: Theory and Application to Travel Demand, 1985.

L. Benaroya, F. Bimbot, and R. Gribonval, Audio source separation with a single sensor, IEEE Transactions on Audio, Speech and Language Processing, vol.14, issue.1
DOI : 10.1109/TSA.2005.854110

URL : https://hal.archives-ouvertes.fr/inria-00544949

S. Bengio, J. Weston, and D. Grangier, Label embedding trees for large multi-class tasks, Advances in Neural Information Processing Systems

Y. Bengio and J. Senecal, Quick training of probabilistic neural nets by importance sampling, Proceedings of the Conference on Artificial Intelligence and Statistics

Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, Neural Probabilistic Language Models, Journal of Machine Learning Research, vol.3
DOI : 10.1007/3-540-33486-6_6

URL : https://hal.archives-ouvertes.fr/hal-01434258

A. Berger, S. D. Pietra, and V. D. Pietra, A maximum entropy approach to natural language processing, Proceedings of the Association for Computational Linguistics, 1996.

D. P. Bertsekas, Nonlinear programming. Athena Scientific,2 n de d i t i o n

C. M. Bishop, Pattern recognition and machine learning

Y. Boureau, F. Bach, Y. Lecun, and J. Ponce, Learning mid-level features for recognition, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5539963

S. Boyd, C. Cortes, M. Mohri, and A. Radovanovic, Accuracy at the top, Advances in Neural Information Processing Systems

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, Large language models in machine translation, Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

L. Breiman, J. Friedman, C. Stone, and R. Olshen, Classification and regression trees

P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. Della-pietra, and J. C. Lai, Class-based n-gram models of natural language, Journal Computational Linguistics, vol.8, issue.1 4, pp.4-6, 1992.

P. Bruckner, An O(n) algorithm for quadratic knapsack problems, Operations Research Letters

E. J. Candes and M. Wakin, An Introduction To Compressive Sampling, IEEE Signal Processing Magazine, vol.25, issue.2
DOI : 10.1109/MSP.2007.914731

A. Chambolle, R. A. Devore, N. Y. Lee, and B. J. Lucier, Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage, IEEE Transactions on Image Processing, vol.7, issue.3
DOI : 10.1109/83.661182

S. F. Chen, Performance prediction for exponential language models, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics on, NAACL '09
DOI : 10.3115/1620754.1620820

S. F. Chen and J. Goodman, An empirical study of smoothing techniques for language modeling, Proceedings of the Association for Computational Linguistics, pp.9-9

S. F. Chen and J. Goodman, An empirical study of smoothing techniques for language modeling, pp.9-9

S. F. Chen and R. Rosenfeld, A gaussian prior for smoothing maximum entropy models

S. F. Chen and R. Rosenfeld, A survey of smoothing techniques for maximum entropy models, IEEE Transactions on Speech and Audio Processing

S. F. Chen, D. Beeferman, and R. Rosenfeld, Evaluation metrics for language models, DARPA Broadcast News Transcription and Understanding Workshop, vol.1

S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM Journal on Scientific Computing

W. Chen, T. Liu, Y. Lan, Z. Ma, and H. Li, Ranking measures and loss functions in learning to rank, Advances in Neural Information Processing Systems

C. Chesneau and M. Hebiri, Some theoretical results on the Grouped Variables Lasso, Mathematical Methods of Statistics, vol.17, issue.4
DOI : 10.3103/S1066530708040030

URL : https://hal.archives-ouvertes.fr/hal-00145160

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and S. Clifford, Introduction to algorithms

F. Couzinie-devy, J. Mairal, F. Bach, and J. Ponce, Dictionary learning for deblurring and digital zoom
URL : https://hal.archives-ouvertes.fr/inria-00627402

K. Crammer and Y. Singer, On the learnability and design of output codes for multiclass problems, Machine Learning Journal,4, vol.7, issue.2 -3

J. N. Darroch and D. Ratcliff, Generalized Iterative Scaling for Log-Linear Models, The Annals of Mathematical Statistics, vol.43, issue.5
DOI : 10.1214/aoms/1177692379

A. Aspremont, F. Bach, and L. Ghaoui, Optimal solutions for sparse principal component analysis, Journal of Machine Learning Research, vol.9

I. Daubechies, M. Defrise, and C. D. , Nonlinear wavelet image processing: Variational problems, compression, and noise removal through wavelet shrinkage, Communications of Puce and Applied Mathematics

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol.41, issue.6, pp.41391-407, 1990.
DOI : 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

S. , D. Pietra, and V. D. Pietra, Statistical modeling by maximum entropy, Unpublished Report, vol.1, pp.9-9

S. , D. Pietra, V. D. Pietra, and J. Lafferty, Inducing features of random variables, IEEE Transactions on Pattern Recognition and Machine Intelligence, vol.1, issue.9 4, pp.3-8, 1997.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

J. Deng, S. Satheesh, A. C. Berg, and L. Fei-fei, Hedging your bets: Optimizing accuracyspecificity trade-offs in large scale visual recognition, Advances in Neural Information Processing Systems

L. Devroye, L. Gyorfi, and G. Lugosi, A probabilistic theory of pattern recognition
DOI : 10.1007/978-1-4612-0711-5

P. S. Dhillon, D. Foster, and L. Ungar, Multi-view learning of word embeddings via cca, Advances in Neural Information Processing Systems

C. B. Do, C. Foo, and A. Y. Ng, Efficient multiple hyperparameter learning for log-linear models, Advances in Neural Information Processing System

J. Duchi and Y. Singer, Efficient online and batch learning using forward-backward splitting, Journal of Machine Learning Research

G. Dunn and B. Everitt, An introduction to mathematical taxonomy

R. C. Edgar and K. Sjolander, SIMULTANEOUS SEQUENCE ALIGNMENT AND TREE CONSTRUCTION USING HIDDEN MARKOV MODELS, Biocomputing 2003
DOI : 10.1142/9789812776303_0018

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, Annals of Statistics, vol.2, issue.2

J. Eisenstein, N. A. Smith, and E. P. Xing, Discovering sociolinguistic associations with structured sparsity, Proceedings of the Association for Computational Linguistics

M. Elad and M. Aharon, Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries, IEEE Transactions on Image Processing, vol.15, issue.12
DOI : 10.1109/TIP.2006.881969

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian et al., Every Picture Tells a Story: Generating Sentences from Images, Proceedings of the European Conference on Computer Vision
DOI : 10.1007/978-3-642-15561-1_2

M. Fazel, H. Hindi, and S. P. Boyd, A rank minimization heuristic with application to minimum order system approximation, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148), 2001.
DOI : 10.1109/ACC.2001.945730

S. Fine and K. Scheinberg, Efficient svm training using low-rank kernel representations, Journal of Machine Learning Research, vol.2, issue.2

J. R. Finkel, A. Kleeman, and C. D. Manning, Efficient, feature-based, conditional random field parsing, Proceedings of the Association for Computational Linguistics

A. Frank, On Kuhn's Hungarian Method?A tribute from Hungary, Naval Research Logistics, vol.5, issue.1
DOI : 10.1002/nav.20056

J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani, Pathwise coordinate optimization, The Annals of Applied Statistics, vol.1, issue.2
DOI : 10.1214/07-AOAS131

W. J. Fu, Penalized regressions: the bridge vs. the lasso, Journal of Computational and Graphical Statistics, vol.7, issue.3

W. A. Gale and K. W. Church, Estimationproceduresfor language context: poor estimates are worse than none, Proceedings of the Symposium on Computational Statistics

W. A. Gale and K. W. Church, What's wrong with adding one?, Proceedings of the Conference on Corpus-Based Research into Language, pp.9-9

T. Gao and D. Koller, Discriminative learning of relaxed hierarchy for large-scale visual recognition, Proceedings of the International Conference on Computer Vision

S. Ghosal and N. L. Hjort, The Dirichlet process, related priors and posterior asymptotics, Bayesian Nonparametrics
DOI : 10.1017/CBO9780511802478.003

R. Giegerich and S. Kurtz, From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction, Algorithmica, vol.19, issue.3, pp.9-9
DOI : 10.1007/PL00009177

M. X. Goemans and D. P. Williamson, Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming, Journal of the ACM, vol.42, issue.6, pp.1115-1145, 2005.
DOI : 10.1145/227683.227684

S. Goldwater, T. L. Griffiths, and J. M. , Interpolating between types and tokens by estimating power-law generators, Advances in Neural Information Processing Systems, 2006.

S. Goldwater, T. L. Griffiths, and M. Johnson, Producing power-law distributions and damping word frequencies with two-stage language models, Journal of Machine Learning Research

G. H. Golub and C. F. Van-loan, Matrix computations, 1996.

I. J. Good, The population frequencies of species and the estimation of population parameters, Biometrika

J. Goodman, Classes for fast maximum entropy training, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221)
DOI : 10.1109/ICASSP.2001.940893

J. Goodman, Sequential conditional Generalized Iterative Scaling, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , ACL '02
DOI : 10.3115/1073083.1073086

J. Goodman, Exponential priors for maximum entropy models, Proceedings of the Association for Computational Linguistics

A. Gramfort, M. Kowalski, and M. Hamalainen, Mixed-norm estimates for the M/EEG inverse problem using accelerated gradient methods, Physics in Medicine and Biology, vol.57, issue.7, p.7
DOI : 10.1088/0031-9155/57/7/1937

URL : https://hal.archives-ouvertes.fr/hal-00690774

G. Griffin and P. Perona, Learning and using taxonomies for fast visual categorization, 2008 IEEE Conference on Computer Vision and Pattern Recognition
DOI : 10.1109/CVPR.2008.4587410

A. Gupta, P. Srinivasan, J. Shi, and L. Davis, Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos, 2009 IEEE Conference on Computer Vision and Pattern Recognition
DOI : 10.1109/CVPR.2009.5206492

Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, and J. Malick, Large-scale image classification with trace-norm regularization, 2012 IEEE Conference on Computer Vision and Pattern Recognition
DOI : 10.1109/CVPR.2012.6248078

URL : https://hal.archives-ouvertes.fr/hal-00728388

T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: Data mining, inference, and prediction, second edition

J. Haupt and R. Nowak, Compressive Sampling Vs. Conventional Imaging, 2006 International Conference on Image Processing
DOI : 10.1109/ICIP.2006.312576

L. He and L. Carin, Exploiting structure in wavelet-based bayesian compressive sensing, IEEE Transaction on Signal Processing

C. Hu, J. T. Kwok, and W. Pan, Accelerated gradient methods for stochastic optimization and online learning, Advances in Neural Information Processing Systems

F. Huang, C. Hsieh, K. Chang, and C. Lin, Iterative scaling and coordinate descent methods for maximum entropy models, Journal of Machine Learning Research, vol.11, pp.815-848, 2010.

J. Huang and T. Zhang, The benefit of group sparsity, The Annals of Statistics, vol.38, issue.4, 0978.
DOI : 10.1214/09-AOS778

J. Huang, T. Zhang, and D. Metaxas, Learning with structured sparsity, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09
DOI : 10.1145/1553374.1553429

J. Huang, T. Zhang, and D. Metaxas, Learning with structured sparsity, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09
DOI : 10.1145/1553374.1553429

M. J. Hunt, Figures of merit for assessing connected-word recognizers, Journal of Speech Communication, vol.9, issue.4

H. Jefreys, Theory of probability

F. Jelinek and R. L. Mercer, Interpolated estimation of markov source parameters from sparse data, Proceedings of the Workshop on Pattern Recognition in Practice

R. Jenatton, J. Audibert, and F. Bach, Structured variable selection with sparsityinducing norms, Journal of Machine Learning Research
URL : https://hal.archives-ouvertes.fr/inria-00377732

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach, Proximal methods for hierarchical sparse coding, Journal of Machine Learning Research
URL : https://hal.archives-ouvertes.fr/inria-00516723

R. Jenatton, A. Gramfort, V. Michel, G. Obozinski, E. Eger et al., Multi-scale Mining of fMRI Data with Hierarchical Structured Sparsity, 2011 International Workshop on Pattern Recognition in NeuroImaging
DOI : 10.1109/PRNI.2011.15

URL : https://hal.archives-ouvertes.fr/inria-00589785

R. Jenatton, N. Le-roux, A. Bordes, and G. Obozinski, A latent factor model for highly multi-relational data, Advances in Neural Information Processing Systems
URL : https://hal.archives-ouvertes.fr/hal-00776335

S. Ji, D. Dunson, and L. Carin, Multitask Compressive Sensing, IEEE Transactions on Signal Processing, vol.57, issue.1
DOI : 10.1109/TSP.2008.2005866

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.183.908

W. Jiang, Process consistency for AdaBoost, The Annals of Statistics, vol.32, issue.1
DOI : 10.1214/aos/1079120128

A. Joulin, F. Bach, and J. Ponce, Efficient optimization for discriminative latent class models, Advances in Neural Information Processing Systems

M. Journee, F. Bach, P. Absil, and R. Sepulchre, Low-Rank Optimization on the Cone of Positive Semidefinite Matrices, SIAM Journal on Optimization, vol.20, issue.5, p.0
DOI : 10.1137/080731359

D. Jurafsky and J. H. Martin, Speech and Language Processing, 2008.

S. M. Katz, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.35, issue.3, pp.400-401, 1987.
DOI : 10.1109/TASSP.1987.1165125

J. Kazama and J. Tsujii, Evaluation and extension of maximum entropy models with inequality constraints, Proceedings of the 2003 conference on Empirical methods in natural language processing -
DOI : 10.3115/1119355.1119373

C. R. Kennington, M. Kay, and A. Friedrich, Suffix trees as language models, Language Resources and Evaluation Conference

S. Khudanpur, A method of maximum entropy estimation with relaxed constraints, Proceedings of the John Hopkins University Language Modeling Workshop, pp.9-9

S. Kim and E. P. Xing, Tree-guided group lasso for multi-task regression with structured sparsity, Proceedings of the International Conference on Machine Learning

S. Kim and E. P. Xing, Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping, The Annals of Applied Statistics, vol.6, issue.3, pp.1095-1117, 2012.
DOI : 10.1214/12-AOAS549SUPP

R. Kneser and H. Ney, Improved backing-off for M-gram language modeling, 1995 International Conference on Acoustics, Speech, and Signal Processing, pp.9-9
DOI : 10.1109/ICASSP.1995.479394

B. Kolaczkowski and J. W. Thornton, Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous, Nature, vol.13, issue.7011
DOI : 10.1080/106351501750435086

V. Koltchinskii and M. Yuan, Sparse recovery in large ensembles of kernel machines on-line learning and bandits, Proceedings of the Conference on Learning Theory

E. V. Koonin and M. Y. Galperin, Sequence -evolution -function, computational approaches in comparative genomics, Kluwer Academic

G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi et al., BabyTalk: Understanding and Generating Simple Image Descriptions, Proceedings of the Conference on Computer Vision and Pattern Recognition
DOI : 10.1109/TPAMI.2012.162

K. Lange, D. R. Hunter, and I. Yang, Optimization Transfer Using Surrogate Objective Functions, Journal of Computational and Graphical Statistics, vol.68, issue.1, pp.1-2
DOI : 10.1080/10618600.2000.10474858

R. Lau, Adaptive statistical language modeling, Massachusetts Insitute of Technology, vol.1, pp.9-9

R. Lau, R. Rosenfeld, and S. Roukos, Adaptive language modeling using the maximum entropy principle, Proceedings of the workshop on Human Language Technology , HLT '93, 1993.
DOI : 10.3115/1075671.1075695

H. Lee, A. Battle, R. Raina, and A. Ng, Efficient sparse coding algorithms, Advances in Neural Information Processing Systems

M. Leordeanu, M. Hebert, and R. Sukthankar, Beyond Local Appearance: Category Recognition from Pairwise Interactions of Simple Features, 2007 IEEE Conference on Computer Vision and Pattern Recognition
DOI : 10.1109/CVPR.2007.383091

S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, Composing simple image descriptions using web-scale n-grams, Proceedings of the Conference on Computational Natural Language Learning

G. J. Lidstone, Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities, Transactions of the Faculty of Actuaries, vol.8

C. Lin, Rouge: a package for automatic evaluation of summaries, Proceedings of the Workshop on Text Summarization Branches Out

K. Linden, Word sense discovery and disambiguation. Department of General Linguistics, Faculty of Arts

K. Lounici, Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators, Electronic Journal of Statistics, vol.2, issue.0
DOI : 10.1214/08-EJS177

URL : https://hal.archives-ouvertes.fr/hal-00222251

K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. Van-de-geer, Taking advantage of sparsity in multi-task, Proceedings of Conference on Computational Learning Theory

G. Lugosi and N. Vayatis, On the bayes risk consistency of regularized boosting methods, Annals of Statistics, vol.1, issue.3
URL : https://hal.archives-ouvertes.fr/hal-00102140

D. J. Mackay and L. C. Peto, A hierarchical Dirichlet language model, Natural Language Engineering, vol.19, issue.03
DOI : 10.1016/0168-9002(94)00931-7

N. Maculan and J. R. Galdino-de-paula, A linear-time median-finding algorithm for projecting a vector on the simplex of rn, Operations Research Letters, vol.8, issue.4, pp.2-3, 1989.

J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce, Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation, Proceedings of the European Conference on Computer Vision (ECCV)
DOI : 10.1007/978-3-540-88690-7_4

J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online dictionary learning for sparse coding, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09
DOI : 10.1145/1553374.1553463

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Non-local sparse models for image restoration, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459452

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Supervised dictionary learning, Advances in Neural Information Processing Systems
URL : https://hal.archives-ouvertes.fr/inria-00322431

J. Mairal, F. Bach, and J. Ponce, Task-Driven Dictionary Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.4
DOI : 10.1109/TPAMI.2011.156

URL : https://hal.archives-ouvertes.fr/inria-00521534

R. Malouf, A comparison of algorithms for maximum entropy parameter estimation, proceeding of the 6th conference on Natural language learning , COLING-02
DOI : 10.3115/1118853.1118871

C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, 1999.

S. Mannor, R. Meir, and T. Zhang, The Consistency of Greedy Algorithms for Classification, Proceedings of the Annual Conference on Computational Learning Theory
DOI : 10.1007/3-540-45435-7_22

H. Markowitz, PORTFOLIO SELECTION*, The Journal of Finance, vol.7, issue.1
DOI : 10.1111/j.1540-6261.1952.tb01525.x

A. F. Martins, N. A. Smith, M. Q. Pedro, and M. A. Figueiredo, Structured sparsity in structured prediction, Proceedings of the Conference on Empirical Methods in Natural Language Processing

P. Mccullagh and J. A. Nelder, Generalized Linear Models, 1989.

G. Mclachlan, Discriminant analysis and statistical pattern recognition, 1992.
DOI : 10.1002/0471725293

N. Meinshausen, Relaxed Lasso, Computational Statistics & Data Analysis, vol.52, issue.1, pp.374-393, 2008.
DOI : 10.1016/j.csda.2006.12.019

N. Meinshausen and P. Buhlmann, Stability selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.7, issue.4, 2008.
DOI : 10.1111/j.1467-9868.2010.00740.x

T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur, Recurrent neural network based language model, Proceedings of the Annual Conference of the International Speech Communication Association

T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. H. Cernocky, Strategies for training large scale neural network language models, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding
DOI : 10.1109/ASRU.2011.6163930

T. Mikolov, W. Yih, and G. Zweig, Linguistic regularities in continuous space word representations, Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

T. Minka, A comparison of numerical optimizers for logistic regression, 2003.

M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos et al., Midge: Generating image descriptions from computer vision detections, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics

A. Mnih and G. Hinton, Three new graphical models for statistical language modelling, Proceedings of the 24th international conference on Machine learning, ICML '07
DOI : 10.1145/1273496.1273577

A. Mnih and G. Hinton, A scalable hierarchical distributed language model, Advances in Neural Information Processing Systems

F. Morin and Y. Bengio, Hierarchical probabilistic neural network language model, Proceedings of the International Conference on Artificial Intel ligence and Statistics

Y. Nardi and A. Rinaldo, On the asymptotic properties of the group lasso estimator for linear models, Electronic Journal of Statistics, vol.2, issue.0
DOI : 10.1214/08-EJS200

Y. Nardi and A. Rinaldo, Autoregressive process modeling via the Lasso procedure, Journal of Multivariate Analysis, vol.102, issue.3
DOI : 10.1016/j.jmva.2010.10.012

Y. Nardi and A. Rinaldo, The log-linear group-lasso estimator and its asymptotic properties, Bernoulli, vol.18, issue.3
DOI : 10.3150/11-BEJ364

B. K. Natarajan, Machine learning: A theoretical approach

R. Navigli, Word sense disambiguation, ACM Computing Surveys, vol.41, issue.2, pp.1-6, 2009.
DOI : 10.1145/1459352.1459355

Y. Nesterov, A method of solving a convex programming problem with convergence rate, Soviet Mathematics Doklady

Y. Nesterov, Gradient methods for minimizing composite objective function

W. Newman, Extension to the maximum entropy model, IEEE Transactions on Information Theory, vol.3, issue.8, pp.9-9

H. Ney, U. Essen, and R. Kneser, On structuring probabilistic dependences in stochastic language modelling, Computer Speech & Language, vol.8, issue.1, pp.1-3
DOI : 10.1006/csla.1994.1001

J. Nocedal and S. J. Wright, Numerical optimization
DOI : 10.1007/b98874

G. Obozinski, M. J. Wainwright, and M. I. Jordan, High-dimensional union support recovery in multivariate regression, Advances in Neural Information Processing Systems, 2008.

G. Obozinski, L. Jacob, and J. P. Vert, Group lasso with overlaps: the latent group lasso approach, Proceedings of the International Conference on Machine Learning
URL : https://hal.archives-ouvertes.fr/inria-00628498

G. Obozinski, B. Taskar, and M. I. Jordan, Joint covariate selection and joint subspace selection for multiple classification problems, Statistics and Computing, vol.8, issue.68, pp.231-252, 2009.
DOI : 10.1007/s11222-008-9111-x

B. A. Olshausen and D. J. Field, Sparse coding with an overcomplete basis set: A strategy employed by V1?, Vision Research, vol.37, issue.23
DOI : 10.1016/S0042-6989(97)00169-7

V. Ordonez, G. Kulkarni, and T. L. Berg, Im2text: Describing images using 1 million captioned photographs, Advances in Neural Information Processing Systems

M. R. Osborne, B. Presnell, and B. A. Turlach, On the lasso and its dual, Journal of Computational and Graphical Statistics, vol.9, issue.2, pp.3-4

A. Owen, A robust hybrid of lasso and ridge regression
DOI : 10.1090/conm/443/08555

K. Papineni and S. Roukos, BLEU, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , ACL '02
DOI : 10.3115/1073083.1073135

P. M. Pardalos and N. Kovoor, An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds, Mathematical Programming, vol.34, issue.3, pp.321-328, 1990.
DOI : 10.1007/BF01585748

N. Parikh and S. Boyd, Proximal algorithms. Foundations and Trends in Optimization, pp.123-231, 2013.

M. Perman, J. Pitman, and M. Yor, Size-biased sampling of poisson point processes and excursions. Probability Theory and Related Fields, pp.2-3

Y. Rabani, L. J. Schulman, and C. Swamy, Approximation algorithms for labeling hierarchical taxonomies, ACM-SIAM Symposium on Discrete Algorithms

R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, Self-taught learning, Proceedings of the 24th international conference on Machine learning, ICML '07
DOI : 10.1145/1273496.1273592

F. Rapaport, E. Barillot, and J. Vert, Classification of arrayCGH data using fused SVM, Bioinformatics, vol.24, issue.13
DOI : 10.1093/bioinformatics/btn188

URL : https://hal.archives-ouvertes.fr/inserm-00293893

A. Ratnaparkhi, Maximum entropy models for natural language ambiguity resolution, pp.9-9

P. Ravikumar, M. J. Wainwright, and J. Lafferty, High-dimensional ising model selection using`1using`using`1 -regularized logistic regression, Annals of Statistics, vol.8, issue.3

P. Ravikumar, A. Tewari, and E. Yang, On the ndcg consistency of listwise ranking methods, Proceedings of the Conference on Artificial Intelligence and Statistics

E. Reiter and R. Dale, Building applied natural language generation systems, Natural Language Engineering, vol.3, issue.1, pp.5-7
DOI : 10.1017/S1351324997001502

B. Roark, M. Saraclar, M. Collins, and M. Johnson, Discriminative language modeling with conditional random fields and the perceptron algorithm, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics , ACL '04
DOI : 10.3115/1218955.1218962

S. Roch, Markov models on trees: Reconstruction and applications

R. Rosenfeld, A maximum entropy approach to adaptive statistical language modeling, Computer Speech and Language

R. Rosenfeld, Adaptive statistical language modeling: A maximum entropy approach, pp.9-9

A. Schijver, Combinatorial optimization, 2003.

H. Schwenk, Efficient training of large neural networks for language modeling, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541)
DOI : 10.1109/IJCNN.2004.1381158

URL : https://hal.archives-ouvertes.fr/hal-01434489

H. Schwenk and J. Gauvain, Training neural network language models on very large corpora, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing , HLT '05
DOI : 10.3115/1220575.1220601

URL : https://hal.archives-ouvertes.fr/hal-01434250

N. Z. Shor, Minimization methods for non-differentiable functions
DOI : 10.1007/978-3-642-82118-9

C. N. Silla-jr and A. A. Freitas, A survey of hierarchical classification across different application domains, Data Mining and Knowledge Discovery, vol.1, issue.487, pp.3-4, 2011.
DOI : 10.1007/s10618-010-0175-9

A. J. Smola and B. Scholkopf, Sparse greedy matrix approximation for machine learning, Proceedings of the International Conference on Machine Learning

S. Sra, S. Nowozin, and S. J. Eright, Optimization for machine learning

N. Srebro, J. D. Rennie, and T. S. Jaakkola, Maximum-margin matrix factorization, Advances in Neural Information Processing Systems

M. Steel, Some statistical aspects of the maximum parsimony method, Experientia Supplementum
DOI : 10.1007/978-3-0348-8114-2_9

K. Stefan, M. Tomas, K. Martin, and B. Lukas, Recurrent neural network based language modeling in meeting recognition, Proceedings of the Annual Conference of the International Speech Communication Association

I. Steinwart, Consistency of Support Vector Machines and Other Regularized Kernel Classifiers, IEEE Transactions on Information Theory, vol.51, issue.1
DOI : 10.1109/TIT.2004.839514

A. Stolcke, Srilm-an extensible language modeling toolkit, Proceedings of International Conference on Spoken Language Processing, 2002.

Y. W. Teh, A hierarchical Bayesian language model based on Pitman-Yor processes, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL , ACL '06
DOI : 10.3115/1220175.1220299

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B, vol.8, issue.1

R. Tibshirani and T. Hastie, Margin trees for high-dimensional classification, Journal of Machine Learning Research, vol.8

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.99, issue.1, pp.91-108, 2005.
DOI : 10.1016/S0140-6736(02)07746-2

K. E. Train, Discrete Choice Methods with simulation, 2003.

J. A. Tropp and A. C. Gilbert, Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit, IEEE Transactions on Information Theory, vol.53, issue.12
DOI : 10.1109/TIT.2007.909108

URL : http://authors.library.caltech.edu/9490/1/TROieeetit07.pdf

J. A. Tropp, A. C. Gilbert, and M. J. Strauss, Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit, Signal Processing, vol.86, issue.3
DOI : 10.1016/j.sigpro.2005.05.030

J. A. Tropp, A. C. Gilbert, and M. J. Strauss, Algorithms for simultaneous sparse approximation. Part II: Convex relaxation, Signal Processing, vol.86, issue.3
DOI : 10.1016/j.sigpro.2005.05.031

P. Tseng, Convergence of block coordinate descent method for nondifferentiable maximation, Journal of Optimization Theory and Applications

P. Tseng, On accelerated proximal gradient metho ds for convex-concave optimization, SIAM Journal on Optimization

E. Ukkonen, Online construction of suffix trees

V. N. Vapnik, Estimation of dependencies based on empirical data, 1982.

V. N. Vapnik, The nature of statistical learning theory, pp.9-9

V. N. Vapnik, Statistical learning theory

G. Varoquaux, R. Jenatton, A. Gramfort, G. Obozinski, B. Thirion et al., Sparse structured dictionary learning for brain resting-state activity modeling. NIPS Workshop on Practical Applications of Sparse Modeling: Open Issues and New Directions

V. Vural and J. G. Dy, A hierarchical method for multi-class support vector machines, Twenty-first international conference on Machine learning , ICML '04
DOI : 10.1145/1015330.1015427

M. J. Wainwright, Sharp thresholds for noisy and high-dimensional recovery of sparsity using`1using`using`1 -constrained quadratic programming

Y. Wang, A. Acero, and C. Chelba, Is word error rate a good indicator for spoken language understanding accuracy, IEEE Workshop on Automatic Speech Recognition and Understanding

D. J. Ward, Adaptive computer interfaces, 2001.

L. Wasserman and K. Roeder, High-dimensional variable selection, The Annals of Statistics, vol.37, issue.5A, pp.2178-2201, 2009.
DOI : 10.1214/08-AOS646

K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, Feature hashing for large scale multitask learning, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09
DOI : 10.1145/1553374.1553516

J. Weston, S. Bengio, and D. Grangier, Label embedding trees for multiclass classification, Advances in Neural Information Processing Systems

C. K. Williams and M. Seeger, Effect of the input density distribution on kernel-based classifiers, Proceedings of the International Conference on Machine Learning

D. Wipf and B. Rao, 0 -norm minimization for basis selection, Advances in Neural Information Processing Systems

D. Wipf and B. Rao, An Empirical Bayesian Strategy for Solving the Simultaneous Sparse Approximation Problem, IEEE Transactions on Signal Processing, vol.55, issue.7, pp.3704-3716, 2007.
DOI : 10.1109/TSP.2007.894265

D. M. Witten, R. Tibshirani, and T. Hastie, A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics, vol.10, issue.3
DOI : 10.1093/biostatistics/kxp008

I. H. Witten and T. C. Bell, The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression, IEEE Transactions on Information Theory, vol.37, issue.4, pp.1085-1094, 1991.
DOI : 10.1109/18.87000

F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh, A stochastic memoizer for sequence data, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2010.
DOI : 10.1145/1553374.1553518

J. Wu and S. Khudanpur, Efficient training methods for maximum entropy language modeling, Proceedings of International Conference on Spoken Language Processing, 2000.

L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research

J. Yang, K. Yu, Y. Gong, and T. Huang, Linear spatial pyramid matching using sparse coding for image classification, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

J. Yang, K. Yu, and T. Huang, Supervised translation-invariant sparse coding, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOI : 10.1109/CVPR.2010.5539958

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.206.339

Y. Yang, C. L. Teo, H. Daume, I. , and Y. Aloimonos, Corpus-guided sentence generation of natural images, Proceedings of the Conference on Empirical Methods in Natural Language Processing

M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.58, issue.1, pp.4-9
DOI : 10.1198/016214502753479356

M. Yuan and Y. Lin, On the non-negative garrotte estimator, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.101, issue.2
DOI : 10.1111/j.1467-9868.2005.00503.x

T. Zhang, Statistical behavior and consistency of classification methods based on convex risk minimization, The Annals of Statistics, vol.32, issue.1
DOI : 10.1214/aos/1079120130

Y. Zhang, A. Aspremont, and L. Ghaoui, Sparse pca: Convex relaxations, algorithms and applications. Handbook on Semidefinite, Cone and Polynomial Optimization, pp.915-940, 2008.

P. Zhao and B. Yu, On mo del selection consistency of lasso, Journal of Machine Learning Research, vol.7, issue.22

P. Zhao, G. Rocha, and B. Yu, The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics

G. Zipf, Selective studies and the principle of relative frequency in language

H. Zou, The Adaptive Lasso and Its Oracle Properties, Journal of the American Statistical Association, vol.101, issue.476
DOI : 10.1198/016214506000000735

H. Zou and H. H. Zhang, On the adaptive elastic-net with a diverging number of parameters, The Annals of Statistics, vol.37, issue.4
DOI : 10.1214/08-AOS625