D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A Learning Algorithm for Boltzmann Machines*, Cognitive Science, vol.85, issue.1, pp.147-169, 1985.
DOI : 10.1207/s15516709cog0901_7

Y. Bengio and Y. Grandvalet, No unbiased estimator of the variance of k-fold cross-validation, Journal of Machine Learning Research, vol.5, pp.1089-1105, 2004.

Y. Bengio and Y. Lecun, Scaling learning algorithms towards ai, Large-Scale Kernel Machines, 2007.

V. Mnih, C. Szepesvari, and J. Audibert, Empirical Bernstein stopping, Proceedings of the 25th international conference on Machine learning, ICML '08, 2008.
DOI : 10.1145/1390156.1390241
URL : https://hal.archives-ouvertes.fr/hal-00834983

H. Paugam-moisy, Parallel neural computing based on network duplicating, Parallel Algorithms for Digital Image Processing, Computer Vision and Neural Networks, pp.305-340, 1993.

V. N. Vapnik, Statistical Learning Theory, 1998.

M. Welling and G. E. Hinton, A New Learning Algorithm for Mean Field Boltzmann Machines, Proceedings of the International Conference on Artificial Neural Networks (ICANN), 2002.
DOI : 10.1007/3-540-46084-5_57

Y. Bengio and Y. Lecun, Scaling learning algorithms towards ai, Large-Scale Kernel Machines, 2007.

Y. Bengio, P. Lamblin, V. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, Advances in Neural Information Processing Systems 19, pp.153-160, 2007.

Y. Bengio and O. Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation, vol.17, issue.6, pp.1601-1621, 2009.
DOI : 10.1145/1390156.1390290

Y. Bengio, A. C. Courville, and P. Vincent, Unsupervised feature learning and deep learning: A review and new perspectives, 1206.

J. Bergstra and Y. Bengio, Random search for hyper-parameter optimization, Journal of Machine Learning Research, vol.13, pp.281-305, 2012.

J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, Algorithms for hyper-parameter optimization, Advances in Neural Information Processing Systems 23, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00642998

H. Bourlard and Y. Kamp, Auto-association by multilayer perceptrons and singular value decomposition, Biological Cybernetics, vol.13, issue.4-5, pp.291-294, 1988.
DOI : 10.1121/1.395916

L. Wray, A. S. Buntine, and . Weigend, Bayesian back-propagation, Complex Systems, vol.5, pp.603-643, 1991.

M. Thomas, J. A. Cover, and . Thomas, Elements of information theory.W i l e y - Interscience, 2006.

A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), vol.39, pp.1-38, 1977.

G. E. Hinton, Training Products of Experts by Minimizing Contrastive Divergence, Neural Computation, vol.22, issue.8, pp.1771-1800, 2002.
DOI : 10.1162/089976600300015385

G. E. Hinton and R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science, vol.313, issue.5786, pp.313504-507, 2006.
DOI : 10.1126/science.1127647

G. E. Hinton, S. Osindero, and Y. Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, vol.18, issue.7, pp.1527-1554, 2006.
DOI : 10.1162/jmlr.2003.4.7-8.1235

H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, Proceedings of the 24th international conference on Machine learning, ICML '07, pp.473-480, 2007.
DOI : 10.1145/1273496.1273556

H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, Exploring strategies for training deep neural networks, The Journal of Machine Learning Research, vol.10, pp.1-40, 2009.

N. L. , R. , and Y. Bengio, Representational power of restricted Boltzmann machines and deep belief networks, Neural Computation, vol.20, pp.1631-1649, 2008.

L. Lecun, Y. Bottou, P. Bengio, and . Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, pp.2278-2324, 1998.
DOI : 10.1109/5.726791

R. M. Neal, Learning stochastic feedforward networks, 1990.

M. Radford and . Neal, Annealed importance sampling, 1998.

S. Rifai, Y. Bengio, Y. Dauphin, and P. Vincent, A generative process for sampling contractive auto-encoders, International Conference on Machine Learning, p.12, 2012.

R. Salakhutdinov and G. Hinton, Deep Boltzmann machines, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), pp.448-455, 2009.

R. Salakhutdinov and I. Murray, On the quantitative analysis of deep belief networks, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.872-879, 2008.
DOI : 10.1145/1390156.1390266

P. Smolensky, Information processing in dynamical systems: foundations of harmony theory, Parallel Distributed Processing, pp.194-281, 1986.

P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.1096-1103, 2008.
DOI : 10.1145/1390156.1390294
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.2238

C. F. Wu, On the Convergence Properties of the EM Algorithm, The Annals of Statistics, vol.11, issue.1, pp.95-103, 1983.
DOI : 10.1214/aos/1176346060

.. A. Fisher-metric and N. H. , would like to acknowledge the Dagstuhl Seminar No 10361 on the Theory of Evolutionary Computation 6 for inspiring their work on natural gradients and beyond, This work was partially supported by the ANR- 2010-COSI-002 grant (SIMINOLE) of the French National Research Agency

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A Learning Algorithm for Boltzmann Machines*, Cognitive Science, vol.85, issue.1, pp.147-169, 1985.
DOI : 10.1207/s15516709cog0901_7

Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi, Bidirectional Relation between CMA Evolution Strategies and Natural Evolution Strategies, Proceedings of Parallel Problem Solving from Nature -PPSN XI, pp.154-163, 2010.
DOI : 10.1007/978-3-642-15844-5_16

Y. Akimoto, A. Auger, and N. Hansen, Convergence of the Continuous Time Trajectories of Isotropic Evolution Strategies on Monotonic $\mathcal C^2$ -composite Functions, Lecture Notes in Computer Science, vol.7491, issue.1, pp.42-51, 2012.
DOI : 10.1007/978-3-642-32937-1_5

S. Amari, Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, pp.251-276, 1998.
DOI : 10.1103/PhysRevLett.76.2188

H. Shun-ichi-amari and . Nagaoka, Methods of information geometry, volume 191 of Translations of Mathematical Monographs, 2000.

D. V. Arnold, Weighted multirecombination evolution strategies. Theoretical computer science, pp.18-37, 2006.
DOI : 10.1016/j.tcs.2006.04.003
URL : http://doi.org/10.1016/j.tcs.2006.04.003

S. Baluja, Population based incremental learning: A method for integrating genetic search based function optimization and competitve learning, 1994.

S. Baluja and R. Caruana, Removing the Genetics from the Standard Genetic Algorithm, Proceedings of ICML'95, pp.38-46, 1995.
DOI : 10.1016/B978-1-55860-377-6.50014-1

Y. Bengio, P. Lamblin, V. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, Advances in Neural Information Processing Systems 19, pp.153-160, 2007.

Y. Bengio, A. C. Courville, and P. Vincent, Unsupervised feature learning and deep learning: A review and new perspectives, 1206.

A. Berny, Selection and Reinforcement Learning for Combinatorial Optimization, Parallel Problem Solving from Nature PPSN VI, pp.601-610, 1917.
DOI : 10.1007/3-540-45356-3_59

A. Berny, An adaptive scheme for real function optimization acting as a selection operator, 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00EX448), pp.140-149, 2000.
DOI : 10.1109/ECNN.2000.886229

A. Berny, Boltzmann machine for population-based incremental learning, ECAI, pp.198-202, 2002.

H. Beyer, The Theory of Evolution Strategies. Natural Computing Series, 2001.

J. Branke, C. Lode, and J. L. Shapiro, Addressing sampling errors and diversity loss in UMDA, Proceedings of the 9th annual conference on Genetic and evolutionary computation , GECCO '07, pp.508-515, 2007.
DOI : 10.1145/1276958.1277068

J. Burbea, Informative geometry of probability spaces, Exposition. Math, vol.4, issue.4, pp.347-378, 1986.

M. Thomas, J. A. Cover, and . Thomas, Elements of information theory.W i l e y - Interscience, 2006.

D. P. Pieter-tjerk-de-boer, S. Kroese, R. Y. Mannor, and . Rubinstein, A Tutorial on the Cross-Entropy Method, Annals of Operations Research, vol.16, issue.3, pp.19-67, 2005.
DOI : 10.1007/s10479-005-5724-z

G. Desjardins, A. Courville, Y. Bengio, P. Vincent, and O. Dellaleau, Parallel tempering for training of restricted Boltzmann machines, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.

M. Gallagher and M. Frean, Population-Based Continuous Optimization, Probabilistic Modelling and Mean Shift, Evolutionary Computation, vol.12, issue.4, pp.29-42, 2005.
DOI : 10.1023/A:1013500812258
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.139.8825

Z. Ghahramani, Unsupervised Learning, Advanced Lectures on Machine Learning, pp.72-112, 2004.
DOI : 10.1080/01621459.1995.10476550

T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra, and J. Schmidhuber, Exponential natural evolution strategies, Proceedings of the 12th annual conference on Genetic and evolutionary computation, GECCO '10, pp.393-400, 2010.
DOI : 10.1145/1830483.1830557

N. Hansen, The CMA evolution strategy: a comparing review Advances on estimation of distribution algorithms, pp.75-102, 2006.

N. Hansen and S. Kern, Evaluating the CMA Evolution Strategy on Multimodal Test Functions, Parallel Problem Solving from Nature PPSN VIII, pp.282-291, 2004.
DOI : 10.1007/978-3-540-30217-9_29

N. Hansen and A. Ostermeier, Completely Derandomized Self-Adaptation in Evolution Strategies, Evolutionary Computation, vol.9, issue.2, pp.159-195, 2001.
DOI : 10.1016/0004-3702(95)00124-7

G. E. Hinton, Training Products of Experts by Minimizing Contrastive Divergence, Neural Computation, vol.22, issue.8, pp.1771-1800, 2002.
DOI : 10.1162/089976600300015385

G. E. Hinton, S. Osindero, and Y. Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, vol.18, issue.7, pp.1527-1554, 2006.
DOI : 10.1162/jmlr.2003.4.7-8.1235

R. Hooke and T. A. Jeeves, `` Direct Search'' Solution of Numerical and Statistical Problems, Journal of the ACM, vol.8, issue.2, pp.212-229, 1961.
DOI : 10.1145/321062.321069

G. A. Jastrebski and D. V. Arnold, Improving Evolution Strategies through Active Covariance Matrix Adaptation, 2006 IEEE International Conference on Evolutionary Computation, pp.2814-2821, 2006.
DOI : 10.1109/CEC.2006.1688662

H. Jeffreys, An Invariant Form for the Prior Probability in Estimation Problems, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol.186, issue.1007, pp.453-461, 1946.
DOI : 10.1098/rspa.1946.0056

E. Peter, E. Kloeden, and . Platen, Numerical solution of stochastic differential equations, Applications of Mathematics, vol.23

S. Kullback, Information theory and statistics, 1968.

P. Larranaga and J. A. Lozano, Estimation of distribution algorithms: A new tool for evolutionary computation, 2002.
DOI : 10.1007/978-1-4615-1539-5

N. Le-roux, P. Manzagol, and Y. Bengio, Topmoumoute online natural gradient algorithm, NIPS, 2007.

L. Malagò, M. Matteucci, and B. D. Seno, An information geometry perspective on estimation of distribution algorithms, Proceedings of the 2008 GECCO conference companion on Genetic and evolutionary computation, GECCO '08, pp.2081-2088, 2008.
DOI : 10.1145/1388969.1389026

L. Malagò, M. Matteucci, and G. Pistone, Towards the geometry of estimation of distribution algorithms based on the exponential family, Proceedings of the 11th workshop proceedings on Foundations of genetic algorithms, FOGA '11, pp.230-242, 2011.
DOI : 10.1145/1967654.1967675

J. Ashworth, N. , and R. Mead, A simplex method for function minimization, The Computer Journal, pp.308-313, 1965.

M. Pelikan, D. E. Goldberg, and F. G. Lobo, A survey of optimization by building and using probabilistic models, Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No.00CH36334), pp.5-20, 2002.
DOI : 10.1109/ACC.2000.879173

C. Rao, Information and the Accuracy Attainable in the Estimation of Statistical Parameters, Bull. Calcutta Math. Soc, vol.37, pp.81-91, 1945.
DOI : 10.1007/978-1-4612-0919-5_16

I. Rechenberg, Evolutionsstrategie '94. Frommann-Holzboog Verlag, 1994.

L. Schwartz, Analyse. II, volume 43 of Collection Enseignement des Sciences [Collection: The Teaching of Science], Calcul différentiel et équations différentielles, 1992.
URL : https://hal.archives-ouvertes.fr/tel-00308504

H. Schwefel, Evolution and Optimum Seeking. Sixth-generation computer technology series, 1995.

F. Silva and L. Almeida, Acceleration techniques for the backpropagation algorithm, Neural Networks, pp.110-119, 1990.
DOI : 10.1007/3-540-52255-7_32

P. Smolensky, Information processing in dynamical systems: foundations of harmony theory, Parallel Distributed Processing, pp.194-281, 1986.

Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber, Efficient natural evolution strategies, Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO '09, pp.539-546, 2009.
DOI : 10.1145/1569901.1569976

V. Torczon, On the Convergence of Pattern Search Algorithms, SIAM Journal on Optimization, vol.7, issue.1, pp.1-25, 1997.
DOI : 10.1137/S1052623493250780

M. Toussaint, Notes on information geometry and evolutionary processes. eprint arXiv:nlin/0408040, 2004.

M. Wagner, A. Auger, and M. Schoenauer, EEDA : A New Robust Estimation of Distribution Algorithms, 2004.
URL : https://hal.archives-ouvertes.fr/inria-00070802

D. Whitley, The genitor algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, Proceedings of the third international conference on Genetic algorithms, pp.116-121, 1989.

L. Arnold, H. Paugam-moisy, and M. Sebag, Optimisation de la topologie pour les réseaux de neurones profonds, 17e congrès francophone AFRIF?AFIA Reconnaissance des Formes et Intelli-gence Artificielle, 2010.

L. Arnold, H. Paugam-moisy, and M. Sebag, Unsupervised layer-wise model selection in deep neural networks, 19th European Conference on Artificial Intelligence Lisbon Portugal, pp.915-920, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00488338

L. Arnold and S. Rebecchi, Sylvain Chevallier, and Hélène Paugam- Moisy. An introduction to deep learning, European Symposium on Artificial Neural Networks, 2011.

L. Arnold, A. Auger, N. Hansen, and Y. Ollivier, Informationgeometric optimization algorithms: A unifying picture via invariance principles ArXiv e-prints, 2011.

L. Arnold and Y. Ollivier, Layer-wise learning of deep generative models ArXiv e-prints, 2012.

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A Learning Algorithm for Boltzmann Machines*, Cognitive Science, vol.85, issue.1, pp.147-169, 1985.
DOI : 10.1207/s15516709cog0901_7

R. Prescott-adams, H. M. Wallach, and Z. Ghahramani, Learning the structure of deep sparse graphical models, Journal of Machine Learning Research -Proceedings Track, vol.9, pp.1-8, 2010.

G. Alain and Y. Bengio, What regularized auto-encoders learn from the data generating distribution. ArXiv e-prints, 2012.

. Shun-ichi-amari, Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, pp.251-276, 1998.
DOI : 10.1103/PhysRevLett.76.2188

H. Shun-ichi-amari, K. Park, and . Fukumizu, Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons, Neural Computation, vol.12, issue.6, pp.1399-1409, 2000.
DOI : 10.1162/089976698300017007

S. Baluja, Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning, 1994.

R. M. Bell, Y. Koren, and C. Volinsky, The bellkor solution to the netflix prize, 2007.

Y. Bengio, Learning Deep Architectures for AI, Foundations and Trends?? in Machine Learning, vol.2, issue.1, p.80, 2007.
DOI : 10.1561/2200000006

Y. Bengio, Deep Learning of Representations, 2013.
DOI : 10.1007/978-3-642-36657-4_1

Y. Bengio and O. Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation, vol.17, issue.6, pp.1601-1621, 2009.
DOI : 10.1145/1390156.1390290
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.334.5982

Y. Bengio and X. Glorot, Understanding the difficulty of training deep feedforward neural networks, Proceedings of AISTATS 2010, pp.249-256, 2010.

Y. Bengio and Y. Lecun, Scaling learning algorithms towards ai In Large-Scale Kernel Machines, p.79, 2007.

Y. Bengio and É. Thibodeau-laufer, Deep generative stochastic networks trainable by backprop. ArXiv e-prints, 2013.

Y. Bengio, O. Delalleau, and N. L. Roux, The curse of highly variable functions for local kernel machines, Advances in Neural Information Processing Systems 18, p.79, 2006.

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, Advances in Neural Information Processing Systems 19, pp.153-160, 2007.

Y. Bengio, A. Courville, and P. Vincent, Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206, p.2012

Y. Bengio, L. Yao, G. Alain, and P. Vincent, Generalized denoising auto-encoders as generative models. ArXiv e-prints, 2013.

J. Bergstra and Y. Bengio, Random search for hyper-parameter optimization, Journal of Machine Learning Research, vol.13, pp.281-305, 2012.

J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, Algorithms for hyper-parameter optimization, Advances in Neural Information Processing Systems, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00642998

C. M. Bishop, Neural Networks for Pattern Recognition, 1995.

S. Borman, The expectation maximization algorithm: A short tutorial, 2004.

H. Bourlard and Y. Kamp, Auto-association by multilayer perceptrons and singular value decomposition, Biological Cybernetics, vol.13, issue.4-5, pp.291-294, 1988.
DOI : 10.1121/1.395916

O. Breuleux, Y. Bengio, and P. Vincent, Quickly Generating Representative Samples from an RBM-Derived Process, Neural Computation, vol.23, issue.8, pp.2053-2073, 2011.
DOI : 10.1080/17442509908834179

A. Miguel, G. E. Carreira-perpiñán, and . Hinton, On contrastive divergence learning, In Artificial Intelligence and Statistics, 2005.

K. Cho, T. Raiko, A. Ilin, and J. Karhunen, A Two-Stage Pretraining Algorithm for Deep Boltzmann Machines, Proceedings of the NIPS 2012 Workshop on Deep Learning and Unsupervised Feature Learning, p.2012
DOI : 10.1007/978-3-642-40728-4_14

D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, Deep neural networks segment neuronal membranes in electron microscopy images, NIPS, pp.2852-2860

D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, Multi-column deep neural network for traffic sign classification, Neural Networks, vol.32, pp.333-338
DOI : 10.1016/j.neunet.2012.02.023

D. Claudiu-ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, Deep big simple neural nets excel on handwritten digit recognition, p.80, 2010.

A. Coates, A. Ng, and H. Lee, An analysis of single-layer networks in unsupervised feature learning, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.

R. Collobert and J. Weston, A unified architecture for natural language processing, Proceedings of the 25th international conference on Machine learning, ICML '08, 2008.
DOI : 10.1145/1390156.1390177

A. Courville, J. Bergstra, and Y. Bengio, Unsupervised models of images by spike-and-slab rbms, Proceedings of the 28th International Conference on Machine Learning (ICML- 11), pp.1145-1152, 2011.

A. Courville, J. Bergstra, and Y. Bengio, The spike and slab restricted boltzmann machine, Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.233-241

R. Threlkeld and C. , Probability, frequency and reasonable expectation, American Journal of Physics, vol.14, pp.1-13, 1946.

G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, and Systems (MCSS), pp.303-314, 1989.

G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.1, pp.30-42
DOI : 10.1109/TASL.2011.2134090

A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), vol.39, pp.1-38, 1977.

L. Deng, M. L. Seltzer, D. Yu, A. Acero, A. Rahman-mohamed et al., Binary coding of speech spectrograms using a deep auto-encoder, INTERSPEECH, pp.1692-1695, 2010.

G. Desjardins, A. Courville, Y. Bengio, P. Vincent, and O. Dellaleau, Parallel tempering for training of restricted boltzmann machines, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.

G. Desjardins, A. Courville, and Y. Bengio, On tracking the partition function, Advances in Neural Information Processing Systems 24, pp.2501-2509, 2011.

G. Desjardins, R. Pascanu, A. Courville, and Y. Bengio, Metric-free natural gradient for joint-training of boltzmann machines. CoRR, abs/1301, 2013.

D. Erhan, P. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, The difficulty of training deep architectures and the effect of unsupervised pre-training, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), p.83, 2009.

B. J. Frey, Continuous sigmoidal belief networks trained using slice sampling, NIPS, pp.452-458, 1996.

J. Brendan, G. E. Frey, and . Hinton, Variational learning in nonlinear gaussian belief networks, Neural Computation, vol.11, issue.1, pp.193-213, 1999.

K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics, vol.40, issue.4, pp.193-202, 1980.
DOI : 10.1007/BF00344251

Z. Ghahramani, Unsupervised Learning, Advanced Lectures on Machine Learning, pp.72-112, 2004.
DOI : 10.1080/01621459.1995.10476550

X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS)
URL : https://hal.archives-ouvertes.fr/hal-00752497

I. J. Goodfellow, Q. Le, A. Saxe, H. Lee, and A. Ng, Measuring invariances in deep networks, Advances in Neural Information Processing Systems 22, pp.646-654, 2009.

I. J. Goodfellow, A. Courville, and Y. Bengio, Spike-and-slab sparse coding for unsupervised feature discovery. CoRR, abs/1201, p.2012

I. J. Goodfellow, A. Courville, and Y. Bengio, Joint training of deep boltzmann machines for classification. ArXiv e-prints, 2013.

A. Graves, Offline Arabic Handwriting Recognition with Multidimensional Recurrent Neural Networks, Guide to OCR for Arabic Scripts, pp.297-313
DOI : 10.1007/978-1-4471-4072-6_12

A. Graves and J. Schmidhuber, Offline Arabic Handwriting Recognition with Multidimensional Recurrent Neural Networks, NIPS, pp.545-552, 2008.
DOI : 10.1007/978-1-4471-4072-6_12

R. Maya, Y. Gupta, and . Chen, Theory and use of the em algorithm. Found. Trends Signal Process, pp.223-296

N. Hansen, The CMA evolution strategy: A tutorial, 2008.
URL : https://hal.archives-ouvertes.fr/hal-01297037

S. Herculano, The human brain in numbers: a linearly scaled-up primate brain, Frontiers in Human Neuroscience, vol.3, issue.00031, p.31, 2009.
DOI : 10.3389/neuro.09.031.2009

G. E. Hinton, Connectionist learning procedures, Artificial Intelligence, vol.40, issue.1-3, pp.185-234, 1989.
DOI : 10.1016/0004-3702(89)90049-0
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.216.5594

G. E. Hinton, Training Products of Experts by Minimizing Contrastive Divergence, Neural Computation, vol.22, issue.8, pp.1771-1800, 2002.
DOI : 10.1162/089976600300015385

G. E. Hinton, A Practical Guide to Training Restricted Boltzmann Machines, 2010.
DOI : 10.1073/pnas.79.8.2554

E. Geoffrey, R. Hinton, and . Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, issue.5786, pp.313504-507, 2006.

G. E. Hinton, S. Osindero, and Y. Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, vol.18, issue.7, pp.1527-1554, 2006.
DOI : 10.1162/jmlr.2003.4.7-8.1235

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, pp.81-91, 2012.

A. L. Hodgkin and A. F. Huxley, A quantitative description of membrane current and its application to conduction and excitation in nerve, The Journal of Physiology, vol.117, issue.4, pp.500-544, 1952.
DOI : 10.1113/jphysiol.1952.sp004764

K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, vol.2, issue.5, pp.359-366, 1989.
DOI : 10.1016/0893-6080(89)90020-8

A. Hyvärinen, Estimation of non-normalized statistical models by score matching, J. Mach. Learn. Res, vol.6, pp.695-709, 2005.

K. Jarrett, K. Kavukcuoglu, Y. Marc-'aurelio-ranzato, and . Lecun, What is the best multi-stage architecture for object recognition?, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459469

K. Kavukcuoglu, Y. Marc-'aurelio-ranzato, and . Lecun, Fast inference in sparse coding algorithms with applications to object recognition. CoRR, abs/1010, p.2010

K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu et al., Learning convolutional feature hierarchies for visual recognition, Advances in Neural Information Processing Systems 23, pp.1090-1098

S. Kirkpatrick, C. Daniel-gelatt-jr, and M. P. Vecchi, Optimization by Simulated Annealing, Science, vol.220, issue.4598, pp.671-680, 1983.
DOI : 10.1126/science.220.4598.671

A. Krizhevsky, Convolutional deep belief networks on cifar-10, p.84, 2010.

A. Krizhevsky and G. E. Hinton, Learning multiple layers of features from tiny images, 2009.

H. Larochelle and Y. Bengio, Classification using discriminative restricted Boltzmann machines, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.536-543, 2008.
DOI : 10.1145/1390156.1390224
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.149.8286

H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, Exploring strategies for training deep neural networks, The Journal of Machine Learning Research, vol.10, issue.86, pp.1-40, 2009.

H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio, Learning algorithms for the classification restricted boltzmann machine, J. Mach. Learn. Res, vol.13, pp.643-669

Q. Le, R. Marc-'aurelio-ranzato, M. Monga, K. Devin, G. Chen et al., Building high-level features using large scale unsupervised learning, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, p.2012
DOI : 10.1109/ICASSP.2013.6639343

Y. Le-le-cun, B. Boser, J. S. Denker, R. E. Howard, W. E. Habbard et al., Handwritten digit recognition with a back-propagation network Advances in neural information processing systems 2, pp.396-404, 1990.

N. L. , R. , and Y. Bengio, Representational power of restricted boltzmann machines and deep belief networks, Neural Computation, vol.20, issue.122, pp.1631-1649, 2008.

N. L. , R. , and A. W. Fitzgibbon, A fast natural newton method, ICML, pp.623-630, 2010.

N. Le-roux, P. Manzagol, and Y. Bengio, Top-moumoute online natural gradient algorithm, Advances in Neural Information Processing Systems, 2007.

Y. Lecun, Generalization and network design strategies, Connectionism in Perspective, 1989.

Y. Lecun and Y. Bengio, Convolutional networks for images, speech, and time-series The Handbook of Brain Theory and Neural Networks, p.82, 1995.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, pp.2278-2324, 1998.
DOI : 10.1109/5.726791

Y. Lecun, L. Bottou, G. B. Orr, and K. Müller, Efficient backprop, Neural Networks: Tricks of the Trade, 1998.

H. Lee, C. Ekanadham, and A. Y. Ng, Sparse deep belief net model for visual area v2, Advances in Neural Information Processing Systems, 2007.

H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.77-80, 2009.
DOI : 10.1145/1553374.1553453

H. Lee, Y. Largman, P. Pham, and A. Y. Ng, Unsupervised feature learning for audio classification using convolutional deep belief networks, Advances in Neural Information Processing Systems 22, pp.1096-1104, 2009.

M. Benjamin, K. Marlin, B. Swersky, N. Chen, and . De-freitas, Inductive principles for restricted boltzmann machine learning, Journal of Machine Learning Research -Proceedings Track, vol.9, pp.509-516, 2010.

J. Martens, Deep learning via hessian-free optimization, Proceedings of the 27th Annual International Conference on Machine Learning, pp.735-742, 2010.
DOI : 10.1007/978-3-642-35289-8_27

J. Martens and I. Sutskever, Learning recurrent neural networks with hessian-free optimization, Lise Getoor and Tobias Scheffer Proceedings of the 28th Annual International Conference on Machine Learning, pp.1033-1040, 2011.
DOI : 10.1007/978-3-642-35289-8_27
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.296.4704

U. Meier, D. Claudiu-ciresan, L. M. Gambardella, and J. Schmidhuber, Better Digit Recognition with a Committee of Simple Neural Nets, 2011 International Conference on Document Analysis and Recognition, pp.1250-1254, 2011.
DOI : 10.1109/ICDAR.2011.252

R. Memisevic, Non-linear latent factor models for revealing structure in high-dimensional data, 2008.

R. Memisevic and G. Hinton, Unsupervised Learning of Image Transformations, 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2007.383036

R. Memisevic and G. E. Hinton, Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines, Neural Computation, vol.17, issue.6, pp.1473-1492, 2010.
DOI : 10.1007/3-540-47969-4_30

G. Mesnil, Y. Dauphin, X. Glorot, S. Rifai, Y. Bengio et al., Unsupervised and transfer learning challenge: a deep learning approach, JMLR W& CP: Proceedings of the Unsupervised and Transfer Learning challenge and workshop, pp.97-110

L. Marvin, S. Minsky, and . Papert, Perceptrons: An introduction to computational geometry, 1969.

G. Montavon and K. Müller, Deep Boltzmann Machines and the Centering Trick, LNCS, vol.10, issue.5, p.2012
DOI : 10.1007/3-540-49430-8_11

I. Murray and R. Salakhutdinov, Evaluating probabilities under highdimensional latent variable models, Advances in Neural Information Processing Systems, p.88, 2009.

V. Nair and G. E. Hinton, 3d object recognition with deep belief nets, Advances in Neural Information Processing Systems 22, pp.1339-1347, 2009.

V. Nair and G. E. Hinton, Rectified linear units improve restricted boltzmann machines, ICML '10: Proceedings of the 27th international conference on Machine learning, pp.807-814, 2010.

M. Radford and . Neal, Probabilistic inference using markov chain monte carlo methods, 1993.

M. Radford and . Neal, Annealed importance sampling, 1998.

A. Ng, Sparse autoencoder. CS294A Lecture notes, 2011.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee et al., Multimodal deep learning, ICML, pp.689-696, 2011.

J. Nocedal and S. J. Wright, Numerical optimization, 2006.
DOI : 10.1007/b98874

A. Bruno, D. J. Olshausen, and . Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature, vol.381, pp.607-609, 1996.

A. Bruno, D. J. Olshausen, and . Field, Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision research, pp.3311-3325, 1997.

T. N. Abdel-rahman-mohamed, G. E. Sainath, B. Dahl, G. E. Ramabhadran, M. A. Hinton et al., Deep belief networks using discriminative features for phone recognition, ICASSP, pp.5060-5063, 2011.

A. Marc, A. Ranzato, G. E. Krizhevsky, and . Hinton, Factored 3- way restricted boltzmann machines for modeling natural images, Journal of Machine Learning Research -Proceedings Track, vol.9, pp.621-628, 2010.

S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, Contractive auto-encoders: Explicit invariance during feature extraction, ICML, pp.833-840, 2011.

S. Rifai, Y. Bengio, Y. Dauphin, and P. Vincent, A generative process for sampling contractive auto-encoders, International Conference on Machine Learning, p.2012

P. Christian, G. Robert, and . Casella, Monte Carlo Statistical Methods (Springer Texts in Statistics), 2005.

E. David, G. E. Rumelhart, R. J. Hinton, and . Williams, Learning internal representations by error propagation, Parallel distributed processing: explorations in the microstructure of cognition, pp.318-362, 1986.

R. Salakhutdinov, Learning and evaluating Boltzmann machines, 2008.

R. Salakhutdinov, Learning in markov random fields using tempered transitions, Advances in Neural Information Processing Systems 22, pp.1598-1606, 2009.

R. Salakhutdinov and G. Hinton, Deep boltzmann machines, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), pp.448-455, 2009.

R. Salakhutdinov and G. Hinton, Semantic hashing, International Journal of Approximate Reasoning, vol.50, issue.7, pp.969-978, 2009.
DOI : 10.1016/j.ijar.2008.11.006

R. Salakhutdinov and G. E. Hinton, A better way to pretrain deep boltzmann machines, NIPS, pp.2456-2464

T. Schmah, G. E. Hinton, R. S. Zemel, S. L. Small, and S. C. Strother, Generative versus discriminative training of rbms for classification of fmri images, NIPS, pp.1409-1416, 2008.

T. J. Sejnowski, Higher-order Boltzmann machines, AIP Conference Proceedings, pp.398-403, 1986.
DOI : 10.1063/1.36246
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.1626

P. Smolensky, Information processing in dynamical systems: foundations of harmony theory, Parallel Distributed Processing, pp.194-281, 1986.

N. Srivastava and R. Salakhutdinov, Multimodal learning with deep boltzmann machines, NIPS, pp.2231-2239, 2012.

I. Sutskever and G. E. Hinton, Deep, Narrow Sigmoid Belief Networks Are Universal Approximators, Neural Computation, vol.20, issue.11, pp.2629-2636, 2008.
DOI : 10.1038/323533a0
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.131.5204

I. Sutskever, J. Martens, and G. Hinton, Generating text with recurrent neural networks, Lise Getoor and Tobias Scheffer Proceedings of the 28th International Conference on Machine Learning (ICML-11),I C M L '11, pp.1017-1024, 2011.

K. Swersky, D. Marc-'aurelio-ranzato, B. Buchman, N. Marlin, and . Freitas, On autoencoders and score matching for energy based models, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML '11, pp.1201-1208, 2011.

W. Graham, G. E. Taylor, and . Hinton, Factored conditional restricted boltzmann machines for modeling motion style, ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning, pp.1025-1032, 2009.

G. W. Taylor, G. E. Hinton, and S. T. Roweis, Modeling human motion using binary latent variables, Advances in Neural Information Processing Systems 19, pp.1345-1352, 2007.

L. Theis, S. Gerwinn, F. Sinz, and M. Bethge, In all likelihood, deep belief is not enough, Journal of Machine Learning Research, vol.12, pp.3071-3096, 2011.

T. Tieleman, Training restricted Boltzmann machines using approximations to the likelihood gradient, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.1064-1071, 2008.
DOI : 10.1145/1390156.1390290

T. Tieleman and G. E. Hinton, Using fast weights to improve persistent contrastive divergence, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.1033-1040, 2009.
DOI : 10.1145/1553374.1553506

P. Vincent, A Connection Between Score Matching and Denoising Autoencoders, Neural Computation, vol.11, issue.7, pp.1661-1674, 2011.
DOI : 10.1007/3-540-46084-5_57

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, J. Mach. Learn. Res, vol.11, pp.3371-3408, 2010.

H. David, W. G. Wolpert, and . Macready, No free lunch theorems for optimization, Evolutionary Computation IEEE Transactions on, vol.1, issue.1, pp.67-82, 1997.

C. F. Wu, On the convergence properties of the EM algorithm. The Annals of Statistics, pp.95-103, 1983.