, In the encoder-decoder architecture, the decoder RNN does not receive x directly, but rather ?(x), the features extracted from the input by the encoder RNN. In this case, our SEARNN classifier includes both the encoder and the decoder RNNs. 2. One could also add ?(x)
An Actor-Critic Algorithm for Sequence Prediction ,
, Proceedings of the 5 th International Conference on Learning Representations (ICLR, vol.117, p.141, 2017.
Training with Exploration Improves a Greedy Stack-LSTM Parser, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016. ,
Convex analysis and monotone operator theory in Hilbert spaces, vol.190, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-01517477
Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications, 2009. ,
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, Advances in Neural Information Processing Systems 28 (NIPS), 2015. ,
Parallel and Distributed Computation: Numerical Methods, vol.37, 1989. ,
Learning reductions that really work, Proceedings of the IEEE, p.212, 2016. ,
The convergence of a class of double-rank minimization algorithms, IMA Journal of Applied Mathematics, vol.15, 1970. ,
Leveraging grammar and reinforcement learning for neural program synthesis, Proceedings of the 6 th International Conference on Learning Representations (ICLR), vol.141, 2018. ,
Méthode générale pour la résolution des systèmes d'équations simultanées. Comptes-rendus hebdomadaires des séances de l'Académie des Sciences, p.14, 1847. ,
Report on the 11th IWSLT evaluation campaign, Proceedings of the International Workshop on Spoken Language Translation (IWSLT), p.131, 2014. ,
Learning to Search Better than Your Teacher, Proceedings of the 32 nd International Conference on Machine Learning (ICML), 2015. (Cited on pages 115, vol.117, p.147 ,
Learning phrase representations using RNN encoder-decoder for statistical machine translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol.128, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01433235
A parallel mixture of SVMs for very large scale problems, Neural Computation, vol.185, 2002. ,
From credit assignment to entropy regularization: two new algorithms for neural sequence prediction, Proceedings of the 56 th annual meeting of the Association for Computational Linguistics (ACL), vol.142, 2018. ,
Learning as search optimization: approximate large margin methods for structured prediction, Proceedings of the 22 nd International Conference on Machine Learning (ICML), vol.140, 2005. ,
Search-based structured prediction, Machine Learning, 2009. ,
The sound of APALM clapping: faster nonsmooth nonconvex optimization with stochastic asynchronous PALM, Advances in Neural Information Processing Systems, vol.29, 2016. ,
Taming the wild: A unified analysis of Hogwild!-style algorithms, Advances in Neural Information Processing Systems 28 (NIPS), 2015. (Cited on pages 21, vol.23, p.41 ,
SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems 27 (NIPS), 2014a. (Cited on pages 3, vol.17, p.92 ,
URL : https://hal.archives-ouvertes.fr/hal-01016843
Finito: A faster, permutable incremental gradient method for big data problems, Proceedings of the 31 st International Conference on Machine Learning (ICML), p.18, 2014. ,
Cold-Start Reinforcement Learning with Softmax Policy Gradient, Advances in Neural Information Processing Systems, vol.30, 2017. ,
Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care, Advances in Neural Information Processing Systems 28 (NIPS), 2015. (Cited on pages 21, vol.23, p.33 ,
Classical structured prediction losses for sequence to sequence learning, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), vol.143, 2018. ,
Token-level and sequence-level loss smoothing for RNN language models, Proceedings of the 56 th annual meeting of the Association for Computational Linguistics (ACL), vol.142, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01790879
A new approach to variable-metric algorithms. The compute journal, vol.15, 1970. ,
Softmax-margin CRFs: Training loglinear models with cost functions, Proceedings of the 2010 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), p.126, 2010. ,
A Dynamic Oracle for Arc-Eager Dependency Parsing, Proceedings of the 24 th International Conference on Computational Linguistics (COL-ING), vol.125, 2012. ,
A family of variable-metric methods derived by variational means. Mathematics of computation, vol.15, 1970. ,
Generative Adversarial Nets, Advances in Neural Information Processing Systems 27 (NIPS), p.143, 2014. ,
Deep Learning, vol.106, p.109, 2016. ,
Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing, Proceedings of the 54 th Annual Meeting of the Association for Computational Linguistics (ACL), p.215, 2016. ,
Classes for fast maximum entropy training, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), p.147, 2001. ,
Supervised sequence labelling with recurrent neural networks, p.109, 2012. ,
Asynchronous stochastic block coordinate descent with variance reduction, 2016. ,
More iterations per second, same quality -why asynchronous algorithms may drastically outperform traditional ones, p.19, 2017. ,
A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction, Advances in Neural Information Processing Systems 23 (NIPS), p.126, 2010. ,
Long Short-Term Memory, Neural Computation, p.127, 1997. ,
Variance Reduced Stochastic Gradient Descent with Neighbors, Advances in Neural Information Processing Systems 28 (NIPS), vol.17, p.177, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01248672
PASSCoDe: parallel asynchronous stochastic dual coordinate descent, Proceedings of the 32 nd International Conference on Machine Learning (ICML), 2015. ,
On Using Very Large Target Vocabulary for Neural Machine Translation, Proceedings of the 53 rd Annual Meeting of the Association for Computational Linguistics (ACL), vol.130, 2015. ,
Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems 26 (NIPS), 2013. (Cited on pages 3, vol.17, p.80 ,
Field-aware factorization machines for CTR prediction, Proceedings of the 10 th ACM Conference on Recommender Systems, p.93, 2016. ,
Deep reinforcement learning for sequence to sequence models, p.113, 2018. ,
Adam: A method for stochastic optimization, Proceedings of the 3 rd International Conference on Learning Representations (ICLR), vol.128, 2015. ,
, , vol.185, 2013.
Lower bounds for reductions, Talk at the Atomic Learning Workshop (TTI-C), p.111, 2006. ,
A stochastic gradient method with an exponential convergence rate for finite training sets, Advances in Neural Information Processing Systems 25 (NIPS), 2012. (Cited on pages 3, vol.17, p.69 ,
URL : https://hal.archives-ouvertes.fr/hal-00674995
ASAGA: asynchronous parallel SAGA, Proceedings of the 20 th International Conference on Artificial Intelligence and Statistics, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01407833
SEARNN: training rnns with global-local losses, Proceedings of the 6 th International Conference on Learning Representations (ICLR), 2018a. (Cited on pages 9, vol.115, p.10 ,
URL : https://hal.archives-ouvertes.fr/hal-01950555
Improved asynchronous parallel optimization analysis for stochastic incremental methods, Journal of Machine Learning Research, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01950558
Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data, Journal of the American Statistical Association, vol.125, 2004. ,
RCV1: A new benchmark collection for text categorization research, Journal of Machine Learning Research, vol.184, 2004. ,
Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization, Advances in Neural Information Processing Systems 28 (NIPS), 2015. ,
Asynchronous stochastic coordinate descent: Parallelism and convergence properties, SIAM Journal on Optimization, vol.24, p.207, 2015. ,
An Asynchronous Parallel Stochastic Coordinate Descent Algorithm, Journal of Machine Learning Research, 2015. ,
Adding vs. averaging in distributed primal-dual optimization, Proceedings of the 32 nd International Conference on Machine Learning (ICML), p.185, 2015. ,
Identifying suspicious URLs: an application of large-scale online learning, Proceedings of the 26 th International Conference on Machine Learning (ICML), vol.184, 2009. ,
Incremental majorization-minimization optimization with application to large-scale machine learning, SIAM Journal on Optimization, p.18, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-00948338
Perturbed iterate analysis for asynchronous stochastic optimization, SIAM Journal on Optimization, vol.204, p.185, 2017. ,
Asynchronous stochastic proximal optimization algorithms with variance reduction, Proceedings of the 31 st AAAI Conference on Artificial Intelligence (AAAI, 2017. ,
Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning, Advances in Neural Information Processing Systems 24 (NIPS), vol.40, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00608041
Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm, Advances in Neural Information Processing Systems 27 (NIPS), vol.51, 2014. ,
Introductory lectures on convex optimization, vol.189, 2004. ,
Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM Journal on Optimization, p.51, 2012. ,
Gradient methods for minimizing composite functions, Mathematical Programming, p.219, 2013. ,
SGD and Hogwild! Convergence without the bounded gradients assumption, Proceedings of the 35 th International Conference on Machine Learning (ICML), vol.23, 2018. ,
Hogwild: A lock-free approach to parallelizing stochastic gradient descent, Advances in Neural Information Processing Systems 24 (NIPS), 2011. (Cited on pages 4, vol.85, p.91 ,
Reward Augmented Maximum Likelihood for Neural Structured Prediction, Advances in Neural Information Processing Systems 29 (NIPS), vol.142, 2016. ,
Cyclades: Conflict-free Asynchronous Machine Learning, Advances in Neural Information Processing Systems 29 (NIPS), 2016. ,
Bleu: a method for automatic evaluation of machine translation, Proceedings of the 40 th Annual Meeting of the Association for Computational Linguistics (ACL), p.131, 2002. ,
Breaking the nonsmooth barrier: A scalable parallel method for composite optimization, Advances in Neural Information Processing Systems, vol.30, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01638058
ARock: an algorithmic framework for asynchronous parallel coordinate updates, SIAM Journal on Scientific Computing, vol.25, p.220, 2016. ,
,
, Regularizing neural networks by penalizing confident output distributions, ICLR 2017 Workshop track, vol.135, 2017.
Entropy and Margin Maximization for Structured Output Learning, Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, p.126, 2010. ,
Imagination-Augmented Agents for Deep Reinforcement Learning, Advances in Neural Information Processing Systems, vol.30, p.147, 2017. ,
Sequence Level Training with Recurrent Neural Networks, Proceedings of the 5 th International Conference on Learning Representations (ICLR), 2016. ,
On variance reduction in stochastic gradient descent and its asynchronous variants, Advances in Neural Information Processing Systems 28 (NIPS), 2015. ,
Stochastic Variance Reduction for Nonconvex Optimization, Proceedings of the 33 rd International Conference on Machine Learning (ICML), vol.60, 2016. ,
Jarret Ross, and Vaibhava Goel. selfcritical sequence training for image captioning, Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, p.221, 2017. ,
A Stochastic Approximation Method. The Annals of Mathematical Statistics, vol.16, 1951. ,
, Reinforcement and Imitation Learning via Interactive No-Regret Learning, vol.112, p.141, 2014.
Convergence rate of stochastic gradient with constant step size, UBC Technical Report, vol.39, 2014. ,
, Non-uniform stochastic average gradient method for training conditional random fields, Proceedings of the 18 th International Conference on Artificial Intelligence and Statistics (AISTATS), p.52, 2015.
Minimizing finite sums with the stochastic average gradient, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-00860051
Proximal stochastic dual coordinate ascent, vol.80, 2012. ,
Stochastic dual coordinate ascent methods for regularized loss minimization, Journal of Machine Learning Research, vol.18, p.46, 2013. ,
Conditioning of quasi-Newton methods for function minimization, vol.15, 1970. ,
Minimum Risk Training for Neural Machine Translation, Proceedings of the 53 rd Annual Meeting of the Association for Computational Linguistics (ACL), p.222, 2016. ,
Sifre Laurent, van den Driessche George, Graepel Thore, and Hassabis Demis, Nature, p.143, 2017. ,
Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction, Proceedings of the 34 th International Conference on Machine Learning (ICML), 2017. ,
Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems 27 (NIPS), vol.107, p.116, 2014. ,
Rethinking the inception architecture for computer vision, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.142, 2016. ,
Max-Margin Markov Networks, Advances in Neural Information Processing Systems 16 (NIPS), vol.128, 2003. ,
Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research, vol.124, 2005. ,
Show and tell: A neural image caption generator, Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
Function optimization using connectionist reinforcement learning algorithms, Connection Science, vol.135, 1991. ,
Sequence-to-Sequence Learning as Beam-Search Optimization, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016. ,
A proximal stochastic gradient method with progressive variance reduction, SIAM Journal on Optimization, vol.80, 2014. ,
Asynchronous parallel greedy coordinate descent, Advances In Neural Information Processing Systems 29 (NIPS), vol.80, 2016. ,
Feature engineering and classifier ensemble for KDD cup, KDD Cup, p.93, 2010. ,
SeqGAN: Sequence generative adversarial nets with policy gradient, Proceedings of the 31 st AAAI Conference on Artificial Intelligence (AAAI, p.143, 2017. ,
Fast Asynchronous parallel stochastic gradient descent, Proceedings of the 30 th AAAI Conference on Artificial Intelligence (AAAI), 2016. (Cited on pages 22, vol.23, p.47 ,
Accelerated mini-batch randomized block coordinate descent method, Advances in neural information processing systems 27 (NIPS), vol.96, 2014. ,
95 4-2 Asynchronous stochastic methods for 1 + 2 -regularized logistic regression 96 4-3 Theoretical speedups for 1 + 2 -regularized logistic regression, vol.34, pp.3-4 ,
Compare and swap in the implementation of, p.186 ,
132 7.4 Evolution of SEARNN performance with the beam rescaling factor, p.138 ,