, In the encoder-decoder architecture, the decoder RNN does not receive x directly, but rather ?(x), the features extracted from the input by the encoder RNN. In this case, our SEARNN classifier includes both the encoder and the decoder RNNs. 2. One could also add ?(x)

D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe et al., An Actor-Critic Algorithm for Sequence Prediction

, Proceedings of the 5 th International Conference on Learning Representations (ICLR, vol.117, p.141, 2017.

M. Ballesteros, Y. Goldberg, C. Dyer, and N. Smith, Training with Exploration Improves a Greedy Stack-LSTM Parser, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

H. Bauschke and P. L. Combettes, Convex analysis and monotone operator theory in Hilbert spaces, vol.190, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01517477

A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications, 2009.

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, Advances in Neural Information Processing Systems 28 (NIPS), 2015.

D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, vol.37, 1989.

A. Beygelzimer, H. Daumé, J. Iii, P. Langford, and . Mineiro, Learning reductions that really work, Proceedings of the IEEE, p.212, 2016.

G. Broyden, The convergence of a class of double-rank minimization algorithms, IMA Journal of Applied Mathematics, vol.15, 1970.

R. Bunel, M. Hausknecht, J. Devlin, R. Singh, and P. Kohli, Leveraging grammar and reinforcement learning for neural program synthesis, Proceedings of the 6 th International Conference on Learning Representations (ICLR), vol.141, 2018.

A. Cauchy, Méthode générale pour la résolution des systèmes d'équations simultanées. Comptes-rendus hebdomadaires des séances de l'Académie des Sciences, p.14, 1847.

M. Cettolo, J. Niehues, S. Stuker, L. Bentivogli, and M. Federico, Report on the 11th IWSLT evaluation campaign, Proceedings of the International Workshop on Spoken Language Translation (IWSLT), p.131, 2014.

K. Chang, A. Krishnamurthy, A. Agarwal, H. Daumé, I. et al., Learning to Search Better than Your Teacher, Proceedings of the 32 nd International Conference on Machine Learning (ICML), 2015. (Cited on pages 115, vol.117, p.147

K. Cho, B. Van-merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol.128, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433235

R. Collobert, S. Bengio, and Y. Bengio, A parallel mixture of SVMs for very large scale problems, Neural Computation, vol.185, 2002.

Z. Dai, Q. Xie, and E. Hovy, From credit assignment to entropy regularization: two new algorithms for neural sequence prediction, Proceedings of the 56 th annual meeting of the Association for Computational Linguistics (ACL), vol.142, 2018.

H. Daumé, ,. Iii, and D. Marcu, Learning as search optimization: approximate large margin methods for structured prediction, Proceedings of the 22 nd International Conference on Machine Learning (ICML), vol.140, 2005.

H. Daumé, J. Iii, D. Langford, and . Marcu, Search-based structured prediction, Machine Learning, 2009.

D. Davis, B. Edmunds, and M. Udell, The sound of APALM clapping: faster nonsmooth nonconvex optimization with stochastic asynchronous PALM, Advances in Neural Information Processing Systems, vol.29, 2016.

C. Christopher-de-sa, K. Zhang, C. Olukotun, and . Ré, Taming the wild: A unified analysis of Hogwild!-style algorithms, Advances in Neural Information Processing Systems 28 (NIPS), 2015. (Cited on pages 21, vol.23, p.41

A. Defazio, F. Bach, and S. Lacoste-julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems 27 (NIPS), 2014a. (Cited on pages 3, vol.17, p.92
URL : https://hal.archives-ouvertes.fr/hal-01016843

A. Defazio, T. Caetano, and J. Domke, Finito: A faster, permutable incremental gradient method for big data problems, Proceedings of the 31 st International Conference on Machine Learning (ICML), p.18, 2014.

N. Ding and R. Soricut, Cold-Start Reinforcement Learning with Softmax Policy Gradient, Advances in Neural Information Processing Systems, vol.30, 2017.

J. C. Duchi, S. Chaturapruek, and C. Ré, Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care, Advances in Neural Information Processing Systems 28 (NIPS), 2015. (Cited on pages 21, vol.23, p.33

S. Edunov, M. Ott, M. Auli, D. Grangier, and M. Ranzato, Classical structured prediction losses for sequence to sequence learning, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), vol.143, 2018.

M. Elbayad, L. Besacier, and J. Verbeek, Token-level and sequence-level loss smoothing for RNN language models, Proceedings of the 56 th annual meeting of the Association for Computational Linguistics (ACL), vol.142, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01790879

R. Fletcher, A new approach to variable-metric algorithms. The compute journal, vol.15, 1970.

K. Gimpel, A. Noah, and . Smith, Softmax-margin CRFs: Training loglinear models with cost functions, Proceedings of the 2010 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), p.126, 2010.

Y. Golberg and J. Nivre, A Dynamic Oracle for Arc-Eager Dependency Parsing, Proceedings of the 24 th International Conference on Computational Linguistics (COL-ING), vol.125, 2012.

D. Goldfarb, A family of variable-metric methods derived by variational means. Mathematics of computation, vol.15, 1970.

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative Adversarial Nets, Advances in Neural Information Processing Systems 27 (NIPS), p.143, 2014.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, vol.106, p.109, 2016.

J. Goodman, A. Vlachos, and J. Naradowsky, Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing, Proceedings of the 54 th Annual Meeting of the Association for Computational Linguistics (ACL), p.215, 2016.

J. Goodman, Classes for fast maximum entropy training, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), p.147, 2001.

A. Graves, Supervised sequence labelling with recurrent neural networks, p.109, 2012.

B. Gu, Z. Huo, and H. Huang, Asynchronous stochastic block coordinate descent with variance reduction, 2016.

R. Hannah and W. Yin, More iterations per second, same quality -why asynchronous algorithms may drastically outperform traditional ones, p.19, 2017.

T. Hazan and R. Urtasun, A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction, Advances in Neural Information Processing Systems 23 (NIPS), p.126, 2010.

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, p.127, 1997.

T. Hofmann, A. Lucchi, S. Lacoste-julien, and B. Mcwilliams, Variance Reduced Stochastic Gradient Descent with Neighbors, Advances in Neural Information Processing Systems 28 (NIPS), vol.17, p.177, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01248672

C. Hsieh, H. Yu, and . Dhillon, PASSCoDe: parallel asynchronous stochastic dual coordinate descent, Proceedings of the 32 nd International Conference on Machine Learning (ICML), 2015.

S. Jean, K. Cho, R. Memisevic, and Y. Bengio, On Using Very Large Target Vocabulary for Neural Machine Translation, Proceedings of the 53 rd Annual Meeting of the Association for Computational Linguistics (ACL), vol.130, 2015.

R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems 26 (NIPS), 2013. (Cited on pages 3, vol.17, p.80

Y. Juan, Y. Zhuang, W. Chin, and C. Lin, Field-aware factorization machines for CTR prediction, Proceedings of the 10 th ACM Conference on Recommender Systems, p.93, 2016.

Y. Keneshloo, T. Shi, R. Naren, and C. K. Reddy, Deep reinforcement learning for sequence to sequence models, p.113, 2018.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, Proceedings of the 3 rd International Conference on Learning Representations (ICLR), vol.128, 2015.

J. Kone?ný and P. Richtárik, , vol.185, 2013.

M. Kääriäinen, Lower bounds for reductions, Talk at the Atomic Learning Workshop (TTI-C), p.111, 2006.

N. L. Roux, M. Schmidt, and F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, Advances in Neural Information Processing Systems 25 (NIPS), 2012. (Cited on pages 3, vol.17, p.69
URL : https://hal.archives-ouvertes.fr/hal-00674995

R. Leblond, F. Pedregosa, and S. Lacoste-julien, ASAGA: asynchronous parallel SAGA, Proceedings of the 20 th International Conference on Artificial Intelligence and Statistics, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01407833

R. Leblond, J. Alayrac, A. Osokin, and S. Lacoste-julien, SEARNN: training rnns with global-local losses, Proceedings of the 6 th International Conference on Learning Representations (ICLR), 2018a. (Cited on pages 9, vol.115, p.10
URL : https://hal.archives-ouvertes.fr/hal-01950555

R. Leblond, F. Pedregosa, and S. Lacoste-julien, Improved asynchronous parallel optimization analysis for stochastic incremental methods, Journal of Machine Learning Research, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01950558

Y. Lee, Y. Lin, and G. Wahba, Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data, Journal of the American Statistical Association, vol.125, 2004.

D. David, Y. Lewis, T. G. Yang, F. Rose, and . Li, RCV1: A new benchmark collection for text categorization research, Journal of Machine Learning Research, vol.184, 2004.

X. Lian, Y. Huang, Y. Li, and J. Liu, Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization, Advances in Neural Information Processing Systems 28 (NIPS), 2015.

J. Liu, J. Stephen, and . Wright, Asynchronous stochastic coordinate descent: Parallelism and convergence properties, SIAM Journal on Optimization, vol.24, p.207, 2015.

J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar, An Asynchronous Parallel Stochastic Coordinate Descent Algorithm, Journal of Machine Learning Research, 2015.

C. Ma, V. Smith, M. Jaggi, M. I. Jordan, P. Richtarik et al., Adding vs. averaging in distributed primal-dual optimization, Proceedings of the 32 nd International Conference on Machine Learning (ICML), p.185, 2015.

J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, Identifying suspicious URLs: an application of large-scale online learning, Proceedings of the 26 th International Conference on Machine Learning (ICML), vol.184, 2009.

J. Mairal, Incremental majorization-minimization optimization with application to large-scale machine learning, SIAM Journal on Optimization, p.18, 2015.
URL : https://hal.archives-ouvertes.fr/hal-00948338

H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran et al., Perturbed iterate analysis for asynchronous stochastic optimization, SIAM Journal on Optimization, vol.204, p.185, 2017.

Q. Meng, W. Chen, J. Yu, T. Wang, Z. Ma et al., Asynchronous stochastic proximal optimization algorithms with variance reduction, Proceedings of the 31 st AAAI Conference on Artificial Intelligence (AAAI, 2017.

E. Moulines and F. R. Bach, Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning, Advances in Neural Information Processing Systems 24 (NIPS), vol.40, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00608041

D. Needell, R. Ward, and N. Srebro, Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm, Advances in Neural Information Processing Systems 27 (NIPS), vol.51, 2014.

Y. Nesterov, Introductory lectures on convex optimization, vol.189, 2004.

Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM Journal on Optimization, p.51, 2012.

Y. Nesterov, Gradient methods for minimizing composite functions, Mathematical Programming, p.219, 2013.

M. Lam, P. H. Nguyen, M. Nguyen, P. Van-dijk, K. Richtárik et al., SGD and Hogwild! Convergence without the bounded gradients assumption, Proceedings of the 35 th International Conference on Machine Learning (ICML), vol.23, 2018.

F. Niu, B. Recht, C. Re, and S. Wright, Hogwild: A lock-free approach to parallelizing stochastic gradient descent, Advances in Neural Information Processing Systems 24 (NIPS), 2011. (Cited on pages 4, vol.85, p.91

M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster et al., Reward Augmented Maximum Likelihood for Neural Structured Prediction, Advances in Neural Information Processing Systems 29 (NIPS), vol.142, 2016.

X. Pan, M. Lam, S. Tu, D. Papailiopoulos, C. Zhang et al., Cyclades: Conflict-free Asynchronous Machine Learning, Advances in Neural Information Processing Systems 29 (NIPS), 2016.

K. Papineni, S. Roukos, T. Ward, and W. Zhu, Bleu: a method for automatic evaluation of machine translation, Proceedings of the 40 th Annual Meeting of the Association for Computational Linguistics (ACL), p.131, 2002.

F. Pedregosa, R. Leblond, and S. Lacoste-julien, Breaking the nonsmooth barrier: A scalable parallel method for composite optimization, Advances in Neural Information Processing Systems, vol.30, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01638058

Z. Peng, Y. Xu, M. Yan, and W. Yin, ARock: an algorithmic framework for asynchronous parallel coordinate updates, SIAM Journal on Scientific Computing, vol.25, p.220, 2016.

G. Pereyra, G. Tucker, J. Chorowski, L. Kaiser, and G. Hinton,

, Regularizing neural networks by penalizing confident output distributions, ICLR 2017 Workshop track, vol.135, 2017.

P. Pletscher, . Cheng-soon, J. M. Ong, and . Buhmann, Entropy and Margin Maximization for Structured Output Learning, Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, p.126, 2010.

T. Sébastien-racanière, D. Weber, L. Reichert, A. Buesing, D. Guez et al., Imagination-Augmented Agents for Deep Reinforcement Learning, Advances in Neural Information Processing Systems, vol.30, p.147, 2017.

A. Marc, S. Ranzato, M. Chopra, W. Auli, and . Zaremba, Sequence Level Training with Recurrent Neural Networks, Proceedings of the 5 th International Conference on Learning Representations (ICLR), 2016.

J. Sashank, A. Reddi, S. Hefny, B. Sra, A. Poczos et al., On variance reduction in stochastic gradient descent and its asynchronous variants, Advances in Neural Information Processing Systems 28 (NIPS), 2015.

J. Sashank, A. Reddi, and . Hefny, Stochastic Variance Reduction for Nonconvex Optimization, Proceedings of the 33 rd International Conference on Machine Learning (ICML), vol.60, 2016.

S. Rennie, E. Marcheret, and Y. Mroueh, Jarret Ross, and Vaibhava Goel. selfcritical sequence training for image captioning, Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, p.221, 2017.

H. Robbins and S. Monro, A Stochastic Approximation Method. The Annals of Mathematical Statistics, vol.16, 1951.

S. Ross and J. Bagnell, Reinforcement and Imitation Learning via Interactive No-Regret Learning, vol.112, p.141, 2014.

M. Schmidt, Convergence rate of stochastic gradient with constant step size, UBC Technical Report, vol.39, 2014.

, Non-uniform stochastic average gradient method for training conditional random fields, Proceedings of the 18 th International Conference on Artificial Intelligence and Statistics (AISTATS), p.52, 2015.

M. Schmidt, N. L. Roux, and F. Bach, Minimizing finite sums with the stochastic average gradient, 2016.
URL : https://hal.archives-ouvertes.fr/hal-00860051

S. Shalev, -. Shwartz, and T. Zhang, Proximal stochastic dual coordinate ascent, vol.80, 2012.

S. Shalev, -. Shwartz, and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization, Journal of Machine Learning Research, vol.18, p.46, 2013.

D. Shanno, Conditioning of quasi-Newton methods for function minimization, vol.15, 1970.

S. Shen, Y. Cheng, Z. He, W. He, H. Wu et al., Minimum Risk Training for Neural Machine Translation, Proceedings of the 53 rd Annual Meeting of the Association for Computational Linguistics (ACL), p.222, 2016.

S. David, S. Julian, S. Karen, A. Ioannis, H. Aja et al., Sifre Laurent, van den Driessche George, Graepel Thore, and Hassabis Demis, Nature, p.143, 2017.

W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell, Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction, Proceedings of the 34 th International Conference on Machine Learning (ICML), 2017.

I. Sutskever, O. Vinyals, and Q. Le, Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems 27 (NIPS), vol.107, p.116, 2014.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.142, 2016.

B. Taskar, C. Guestrin, and D. Koller, Max-Margin Markov Networks, Advances in Neural Information Processing Systems 16 (NIPS), vol.128, 2003.

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research, vol.124, 2005.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

J. Ronald, J. Williams, and . Peng, Function optimization using connectionist reinforcement learning algorithms, Connection Science, vol.135, 1991.

S. Wiseman and A. Rush, Sequence-to-Sequence Learning as Beam-Search Optimization, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM Journal on Optimization, vol.80, 2014.

Y. You, X. Lian, J. Liu, H. Yu, S. Inderjit et al., Asynchronous parallel greedy coordinate descent, Advances In Neural Information Processing Systems 29 (NIPS), vol.80, 2016.

H. Yu, H. Lo, H. Hsieh, J. Lou, G. Todd et al., Feature engineering and classifier ensemble for KDD cup, KDD Cup, p.93, 2010.

L. Yu, W. Zhang, J. Wang, and Y. Yu, SeqGAN: Sequence generative adversarial nets with policy gradient, Proceedings of the 31 st AAAI Conference on Artificial Intelligence (AAAI, p.143, 2017.

Y. Shen, W. Zhao, and . Li, Fast Asynchronous parallel stochastic gradient descent, Proceedings of the 30 th AAAI Conference on Artificial Intelligence (AAAI), 2016. (Cited on pages 22, vol.23, p.47

T. Zhao, M. Yu, Y. Wang, R. Arora, and H. Liu, Accelerated mini-batch randomized block coordinate descent method, Advances in neural information processing systems 27 (NIPS), vol.96, 2014.

. .. Vs-sparse-saga-updates, 95 4-2 Asynchronous stochastic methods for 1 + 2 -regularized logistic regression 96 4-3 Theoretical speedups for 1 + 2 -regularized logistic regression, vol.34, pp.3-4

. .. B-1-;-asaga, Compare and swap in the implementation of, p.186

. .. Basic, 132 7.4 Evolution of SEARNN performance with the beam rescaling factor, p.138