. Smmt-(caglayan, , 2017.

(. Reranking and . Zhang, , 2017.

. Smmt-(calixto, , 2017.

. Nmt-(caglayan, , 2017.

, A ention, 2017.

O. Caglayan, W. Aransa, Y. Wang, M. Masana, M. García-martínez et al., Does multimodality help human and machine for translation and image captioning?, Proceedings of the First Conference on Machine Translation, pp.627-633, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01433183

O. Caglayan, L. Barrault, and F. Bougares, Multimodal a ention for neural machine translation, 2016.

O. Caglayan, W. Aransa, A. Bardet, M. García-martínez, F. Bougares et al., LIUM-CVC submissions for WMT17 multimodal translation task, Shared Task Papers. Association for Computational Linguistics, vol.2, pp.432-439, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01742382

O. Caglayan, M. García-martínez, and A. Bardet, NMTPY: A exible toolkit for advanced neural machine translation systems, Prague Bull. Math. Linguistics, vol.109, pp.15-28, 2017.

O. Caglayan, A. Bardet, F. Bougares, L. Barrault, K. Wang et al., LIUM-CVC submissions for WMT18 multimodal translation task, Proceedings of the ird Conference on Machine Translation, pp.603-608, 2018.

R. Sanabria, O. Caglayan, S. Palaskar, D. Ellio, L. Barrault et al., How2: A large-scale dataset for multimodal language understanding, Proceedings of the Workshop on Visually Grounded Interaction and Language, 2018.

O. Caglayan, R. Sanabria, S. Palaskar, L. Barrault, and F. Metze, Multimodal grounding for sequence-to-sequence speech recognition, 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.

O. Caglayan, P. Madhyastha, L. Specia, and L. Barrault, Probing the need for visual context in multimodal machine translation, Proceedings of the 2019, 2019.

, Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.4159-4170

R. Abu-zhaya, A. Seidl, R. Tinco, and A. Cristia, Building a multimodal lexicon: Lessons from infants' learning of body part words, Proc. GLU 2017 International Workshop on Grounding Language Understanding, pp.18-21, 2017.

W. Aransa, H. Schwenk, and L. Barrault, Improving continuous space language models using auxiliary features, Proceedings of the 12th International Workshop on Spoken Language Translation. Da Nang, pp.151-158, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01454941

D. Arpit and Y. Bengio, 2019. e bene ts of over-parameterization at initialization in deep relu networks

M. Hasan-sait-arslan, G. Fishel, and . Anbarjafari, Doubly a entive transformer machine translation, 2018.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014.

T. Baltrusaitis, C. Ahuja, and L. Morency, Multimodal machine learning: A survey and taxonomy, 2017.

L. Barrault, F. Bougares, L. Specia, C. Lala, D. Ellio et al., Findings of the third shared task on multimodal machine translation, Proceedings of the ird Conference on Machine Translation, vol.2, pp.308-327, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02008843

R. Bawden, Going beyond the sentence : Contextual Machine Translation of Dialogue. eses, 2018.
URL : https://hal.archives-ouvertes.fr/tel-02004683

. Atilim-güne?-baydin and A. Barak, Pearlmu er, Alexey Andreyevich Radul, and Je rey Mark Siskind, J. Mach. Learn. Res, vol.18, issue.1, pp.5595-5637, 2017.

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, 2015.

D. D. Lawrence, M. Lee, R. Sugiyama, and . Garne, Advances in Neural Information Processing Systems, vol.28, pp.1171-1179

Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, A neural probabilistic language model, Journal of machine learning research, vol.3, pp.1137-1155, 2003.

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is di cult, IEEE Transactions on Neural Networks, vol.5, issue.2, pp.157-166, 1994.

S. Bergsma and B. Van-durme, Learning bilingual lexicons using the visual similarity of labeled web images, Proceedings of the Twenty-Second International Joint Conference on Arti cial Intelligence -Volume Volume ree, pp.1764-1769, 2011.

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics, vol.5, issue.1, pp.135-146, 2017.

O. Bojar, C. Federmann, M. Fishel, B. Yve-e-graham, . Haddow et al., Findings of the 2018 conference on machine translation (wmt18), Proceedings of the ird Conference on Machine Translation, vol.2, pp.272-307, 2018.

T. Bolukbasi, K. Chang, J. Zou, V. Saligrama, and A. Kalai, Man is to computer programmer as woman is to homemaker? debiasing word embeddings, Proceedings of the 30th International Conference on Neural Information Processing Systems, pp.4356-4364, 2016.

N. Boulanger-lewandowski, Y. Bengio, and P. Vincent, Audio chord recognition with recurrent neural networks, Proceedings of the 14th International Society for Music Information Retrieval Conference, pp.335-340, 2013.

D. Britz, A. Goldie, M. Luong, and O. Le, Massive exploration of neural machine translation architectures, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.1442-1451, 2017.

F. Burlot, M. García-martínez, L. Barrault, F. Bougares, and F. Yvon, Word representations in factored neural machine translation, Proceedings of the Second Conference on Machine Translation, pp.20-31, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01618384

O. Caglayan, W. Aransa, A. Bardet, M. García-martínez, F. Bougares et al., LIUM-CVC submissions for WMT17 multimodal translation task, Proceedings of the Second Conference on Machine Translation, vol.2, pp.432-439, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01742382

O. Caglayan, W. Aransa, Y. Wang, M. Masana, M. García-martínez et al., Does multimodality help human and machine for translation and image captioning?, Proceedings of the First Conference on Machine Translation, pp.627-633, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01433183

O. Caglayan, A. Bardet, F. Bougares, L. Barrault, K. Wang et al., LIUM-CVC submissions for WMT18 multimodal translation task, Proceedings of the ird Conference on Machine Translation, pp.603-608, 2018.

O. Caglayan, L. Barrault, and F. Bougares, Multimodal a ention for neural machine translation, 2016.

O. Caglayan, M. García-martínez, A. Bardet, W. Aransa, F. Bougares et al., NMTPY: A exible toolkit for advanced neural machine translation systems, Prague Bull. Math. Linguistics, vol.109, pp.15-28, 2017.

O. Caglayan, P. Madhyastha, L. Specia, and L. Barrault, Probing the need for visual context in multimodal machine translation, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.4159-4170, 2019.

O. Caglayan, R. Sanabria, S. Palaskar, L. Barrault, and F. Metze, Multimodal grounding for sequence-to-sequence speech recognition, 2019.

, IEEE International Conference on Acoustics, Speech and Signal Processing

A. Caliskan, J. J. Bryson, and A. Narayanan, Semantics derived automatically from language corpora contain human-like biases, Science, vol.356, issue.6334, pp.183-186, 2017.

I. Calixto, K. Du-a-chowdhury, and N. Liu, DCU system report on the WMT 2017 multi-modal machine translation task, Shared Task Papers. Association for Computational Linguistics, vol.2, pp.440-444, 2017.

I. Calixto, D. Ellio, and S. Frank, DCU-UvA multimodal MT system report, Proceedings of the First Conference on Machine Translation, pp.634-638, 2016.

I. Calixto and N. Liu, Incorporating global visual features into a entionbased neural machine translation, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.992-1003, 2017.

I. Calixto, N. Liu, and . Campbell, Doubly-a entive decoder for multimodal neural machine translation, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.1913-1924, 2017.

R. Caruana, Multitask learning, Machine Learning, vol.28, issue.1, pp.41-75, 1997.

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, Listen, a end and spell: A neural network for large vocabulary conversational speech recognition, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4960-4964, 2016.

K. Cho and M. Esipova, Can neural machine translation do simultaneous translation?, 2016.

K. Cho, D. Bart-van-merrienboer, Y. Bahdanau, and . Bengio, On the properties of neural machine translation: Encoder-decoder approaches, Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp.103-111, 2014.

K. Cho, C. Bart-van-merrienboer, D. Gulcehre, F. Bahdanau, H. Bougares et al., Learning phrase representations using rnn encoder-decoder for statistical machine translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.1724-1734, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433235

G. Chrupa?a, Á. Kádár, and A. Alishahi, Learning language through pictures, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol.2, pp.112-118, 2015.

J. Chung, A. W. Senior, O. Vinyals, and A. Zisserman, Lip reading sentences in the wild, 2017 IEEE Conference on Computer Vision and Pa ern Recognition, pp.3444-3453, 2017.

J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.

J. H. Clark, C. Dyer, A. Lavie, and N. A. Smith, Be er hypothesis testing for statistical machine translation: Controlling for optimizer instability, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol.2, pp.176-181, 2011.

D. Clevert, U. Omas, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs), 2015.

F. B. Colavita, Human sensory dominance, Perception & Psychophysics, vol.16, issue.2, pp.409-412, 1974.

G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems, vol.2, issue.4, pp.303-314, 1989.

F. Dalvi, N. Durrani, H. Sajjad, and S. Vogel, Incremental decoding and training methods for simultaneous translation in neural machine translation, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.2, 2018.

E. Delavenay, K. M. Delavenay, ;. Ames, and H. London, An introduction to machine translation, 1960.

J. Delbrouck and S. Dupont, Modulating and a ending the source image during encoding improves multimodal translation, 2017.

J. Delbrouck and S. Dupont, Multimodal compact bilinear pooling for multimodal neural machine translation, 2017.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, pp.248-255, 2009.

M. Denkowski and A. Lavie, Meteor universal: Language speci c translation evaluation for any target language, Proceedings of the Ninth Workshop on Statistical Machine Translation, pp.376-380, 2014.

J. Devlin, R. Zbib, Z. Huang, O. Lamar, R. Schwartz et al., Fast and robust neural network joint models for statistical machine translation, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol.1, pp.1370-1380, 2014.

D. Dong, H. Wu, W. He, D. Yu, and H. Wang, Multi-task learning for multiple language translation, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol.1, pp.1723-1732, 2015.

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, vol.12, pp.2121-2159, 2011.

J. Duselis, M. Hu, J. Gwinnup, J. Davis, and J. Sandvick, AFRL-OSU WMT17 multimodal translation system: An image processing approach, Proceedings of the Second Conference on Machine Translation, vol.2, pp.445-449, 2017.

D. Ellio, Adversarial evaluation of multimodal machine translation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.2974-2978, 2018.

D. Ellio, S. Frank, L. Barrault, F. Bougares, and L. Specia, Findings of the second shared task on multimodal machine translation and multilingual image description, Proceedings of the Second Conference on Machine Translation, vol.2, pp.215-233, 2017.

D. Ellio, S. Frank, K. Sima'an, and L. Specia, Multi30k: Multilingual english-german image descriptions, Proceedings of the 5th Workshop on Vision and Language, pp.70-74, 2016.

D. Ellio and . Kádár, Imagination improves multimodal translation, Proceedings of the Eighth International Joint Conference on Natural Language Processing, vol.1, pp.130-141, 2017.

J. Rey-l-elman, Finding structure in time, Cognitive science, vol.14, issue.2, pp.179-211, 1990.

O. Marc, M. S. Ernst, and . Banks, Humans integrate visual and haptic information in a statistically optimal fashion, Nature, vol.415, p.429, 2002.

O. Firat, K. Cho, B. Sankaran, T. Y. Fatos, Y. Vural et al., Multi-way, multilingual neural machine translation, Comput. Speech Lang, vol.45, pp.236-252, 2017.

J. Rupert and F. , A synopsis of linguistic theory 1930-1955, Studies in Linguistic Analysis, 1957.

S. Frank, D. Ellio, and L. Specia, Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices, Natural Language Engineering, vol.24, issue.3, pp.393-413, 2018.

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal compact bilinear pooling for visual question answering and visual grounding, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.457-468, 2016.

K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pa ern recognition una ected by shi in position, Biological Cybernetics, vol.36, issue.4, pp.193-202, 1980.

Y. Gal and Z. Ghahramani, A theoretically grounded application of dropout in recurrent neural networks, Proceedings of the 30th International Conference on Neural Information Processing Systems, pp.1027-1035, 2016.

M. García-martínez, O. Caglayan, W. Aransa, and A. Bardet, Lium machine translation systems for WMT17 news translation task, Proceedings of the Second Conference on Machine Translation, pp.288-295, 2017.

J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, Convolutional sequence to sequence learning, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.1243-1252, 2017.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, e IEEE Conference on Computer Vision and Pa ern Recognition (CVPR), 2014.

X. Glorot and Y. Bengio, Understanding the di culty of training deep feedforward neural networks, In Proceedings of the irteenth International Conference on Arti cial Intelligence and Statistics. PMLR, vol.9, pp.249-256, 2010.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, 2016.

T. Yve-e-graham and . Baldwin, Can machine translation systems be evaluated by the crowd alone, Natural Language Engineering, vol.23, issue.1, pp.3-30, 2017.

A. Graves, Generating sequences with recurrent neural networks, 2013.

K. Gre, R. Srivastava, J. Koutník, R. Bas, J. Steunebrink et al., Lstm: A search space odyssey, 2015.

S. Grönroos, B. Huet, M. Kurimo, J. Laaksonen, and B. Merialdo, Phu Pham, Mats Sjöberg, Umut Sulubacak, Jörg Tiedemann, Raphael Troncy, and Raúl Vázquez. 2018. e MeMAD submission to the WMT18 multimodal translation task, Proceedings of the ird Conference on Machine Translation. Association for Computational Linguistics, pp.609-617

J. Gu, G. Neubig, K. Cho, O. K. Victor, and . Li, Learning to translate in real-time with neural machine translation, Proceedings of the 15th Conference of the European Chapter, vol.1, pp.1053-1062, 2017.

C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio, Noisy activation functions, Proceedings of e 33rd International Conference on Machine Learning, vol.48, pp.3059-3068, 2016.

J. Gwinnup, J. Sandvick, M. Hu, G. Erdmann, J. Duselis et al., AFRL-Ohio State WMT18 multimodal system: Combining visual with traditional, Proceedings of the ird Conference on Machine Translation, pp.618-621, 2018.

A. Ha, J. Niehues, and A. Waibel, Toward multilingual neural machine translation with universal encoder and decoder, Proceedings of the 13th International Workshop on Spoken Language Translation, 2016.

S. Zellig and . Harris, Distributional structure. ¡i¿WORD¡/i¿, vol.10, pp.146-162, 1954.

K. He, Z. Xiangyu, R. Shaoqing, and J. Sun, Delving deep into recti ers: Surpassing human-level performance on imagenet classi cation, 2015 IEEE International Conference on Computer Vision (ICCV), pp.1026-1034, 2015.

K. He, Z. Xiangyu, R. Shaoqing, and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, pp.770-778, 2016.

J. Helcl and J. Libovický, CUNI system for the WMT17 multimodal translation task, Proceedings of the Second Conference on Machine Translation, vol.2, pp.450-457, 2017.

J. Helcl, J. Libovický, and D. Varis, CUNI system for the WMT18 multimodal translation task, Proceedings of the ird Conference on Machine Translation, pp.622-629, 2018.

S. Hochreiter, Recurrent neural net learning and vanishing gradient, International Journal Of Uncertainity, Fuzziness and Knowledge-Based Systems, vol.6, issue.2, pp.107-116, 1998.

S. Hochreiter and J. Schmidhuber, Long Short-term Memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcro et al., Snapshot ensembles: Train 1, get m for free, International Conference on Learning Representations, 2017.

G. Huang, Z. Liu, L. Van-der-maaten, and K. Weinberger, Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, 2017.

P. Huang, F. Liu, S. Shiang, J. Oh, and C. Dyer, A entionbased multimodal neural machine translation, Proceedings of the First Conference on Machine Translation, pp.639-645, 2016.

H. David, . Hubel, N. Torsten, and . Wiesel, Receptive elds, binocular interaction and functional architecture in the cat's visual cortex, Journal of Physiology, vol.160, pp.106-154, 1962.

K. Hakan-inan, R. Khosravi, and . Socher, Tying word vectors and word classi ers: A loss framework for language modeling, 2016.

C. Sergey-io-e and . Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shi, Proceedings of e 32nd International Conference on Machine Learning, pp.448-456, 2015.

P. Isabelle, C. Cherry, and G. Foster, A challenge set approach to evaluating machine translation, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.2486-2496, 2017.

M. Jana, S. Iverson, and . Goldin-meadow, Gesture paves the way for language development, Psychological science, vol.16, issue.5, pp.367-371, 2005.

M. Johnson, M. Schuster, M. Le, Y. Krikun, Z. Wu et al., Google's multilingual neural machine translation system: Enabling zero-shot translation, 2016.

R. Jozefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent network architectures, Proceedings of the 32nd International Conference on International Conference on Machine Learning, vol.37, pp.2342-2350, 2015.

N. Kalchbrenner and P. Blunsom, Recurrent continuous translation models, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.1700-1709, 2013.

V. Kazemi and A. Elqursh, Show, ask, a end, and answer: A strong baseline for visual question answering, 2017.

D. Kiela, I. Vuli?, and S. Clark, Visual bilingual lexicon induction with transferred convnet features, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.148-158, 2015.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

R. Kiros, R. Salakhutdinov, and R. S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, 2014.

G. Klambauer, A. Omas-unterthiner, S. Mayr, U. V. Hochreiter-;-guyon, S. Luxburg et al., Selfnormalizing neural networks. In I, Advances in Neural Information Processing Systems, vol.30, pp.971-980, 2017.

R. Kneser and H. Ney, Improved backing-o for M-gram language modeling, 1995 International Conference on Acoustics, Speech, and Signal Processing, vol.1, pp.181-184, 1995.

P. Koehn, Statistical Machine Translation, 2010.
URL : https://hal.archives-ouvertes.fr/hal-01433972

P. Koehn, H. Hoang, A. Birch, C. Callison-burch, M. Federico et al., Moses: Open source toolkit for statistical machine translation, Meeting of the Association for Computational Linguistics, pp.177-180, 2007.

P. Koehn and R. Knowles, Six challenges for neural machine translation, Proceedings of the First Workshop on Neural Machine Translation, pp.28-39, 2017.

P. Koehn, F. J. Och, and D. Marcu, Statistical Phrase-based Translation, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol.1, pp.48-54, 2003.

A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet classi cation with deep convolutional neural networks, Advances in Neural Information Processing Systems 25, pp.1097-1105, 2012.

A. Krogh and J. A. Hertz, A simple weight decay can improve generalization, Advances in Neural Information Processing Systems, vol.4, pp.950-957, 1992.

T. Kudo and J. Richardson, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.66-71, 2018.

C. Lala, P. Swaroop-madhyastha, C. Scarton, and L. Specia, She eld submissions for WMT18 multimodal translation shared task, Proceedings of the ird Conference on Machine Translation. Association for Computational Linguistics, pp.630-637, 2018.

A. Lavie and A. Agarwal, Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments, Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, pp.228-231, 2007.

Y. Lecun, A theoretical framework for back-propagation, Proceedings of the 1988 Connectionist Models Summer School, pp.21-28, 1988.

Y. Lecun, Y. Bengio, and G. Rey-hinton, Deep learning, Nature, vol.521, issue.7553, pp.436-444, 2015.

Y. Lecun, L. Bo-ou, Y. Bengio, and P. Ha-ner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol.86, issue.11, pp.2278-2324, 1998.

J. Libovický and J. Helcl, A ention strategies for multi-source sequence-to-sequence learning, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol.2, pp.196-202, 2017.

J. Libovick?, S. Palaskar, S. Gella, and F. Metze, Multimodal abstractive summarization of opendomain videos, NeurIPS Workshop on Visually Grounded Interaction and Language (ViGIL), 2018.

J. Libovický, J. Helcl, and M. Tlustý, Ond?ej Bojar, and Pavel Pecina. 2016. CUNI system for WMT16 automatic post-editing and multimodal translation tasks, Proceedings of the First Conference on Machine Translation. Association for Computational Linguistics, pp.646-654

J. Libovický, J. Helcl, and D. Mare?ek, Input combination strategies for multi-source transformer decoder, Proceedings of the ird Conference on Machine Translation: Research Papers, pp.253-260, 2018.

I. Loshchilov and F. Hu-er, Decoupled weight decay regularization, International Conference on Learning Representations, 2019.

M. Luong, I. Le, O. Sutskever, L. Vinyals, and . Kaiser,

, Multi-task sequence to sequence learning

M. Luong, H. Pham, and C. Manning, E ective approaches to a ention-based neural machine translation, 2015.

M. Ma, D. Li, K. Zhao, and L. Huang, OSU multimodal machine translation system report, Proceedings of the Second Conference on Machine Translation, vol.2, pp.465-469, 2017.

J. Pranava-swaroop-madhyastha, L. Wang, and . Specia, She eld MultiMT: Using object posterior predictions for multimodal machine translation, Proceedings of the Second Conference on Machine Translation, vol.2, pp.470-476, 2017.

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang et al., Deep captioning with multimodal recurrent neural networks (m-rnn), International Conference on Learning Representations, 2015.

J. Martens, Deep learning via hessian-free optimization, Proceedings of the 27th International Conference on International Conference on Machine Learning, pp.735-742, 2010.

S. Warren, W. Mcculloch, and . Pi, A logical calculus of the ideas immanent in nervous activity, Bulletin of Mathematical Biophysics, vol.5, issue.4, pp.115-133, 1943.

T. Mikolov, K. Chen, G. Corrado, and J. Rey-dean, E cient estimation of word representations in vector space, 2013.

T. Mikolov, M. Kara-Át, and L. Burget, Recurrent neural network based language model, vol.2, p.3, 2010.

A. Neelakantan, L. Vilnis, I. Le, L. Sutskever, K. Kaiser et al., Adding gradient noise improves learning for very deep networks, 2015.

H. Eric, T. Nyberg, and . Mitamura, e kant system: Fast, accurate, high-quality translation in practical domains, Proceedings of the 14th Conference on Computational Linguistics, vol.3, pp.1069-1073, 1992.

F. Och, Minimum Error Rate Training in Statistical Machine Translation, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol.1, pp.160-167, 2003.

K. Papineni, S. Roukos, T. Ward, and W. Zhu, Bleu: A method for automatic evaluation of machine translation, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp.311-318, 2002.

R. Pascanu, Ç. Gülçehre, K. Cho, and Y. Bengio, How to construct deep recurrent neural networks, International Conference on Learning Representations (ICLR), 2014.

R. Pascanu, T. Mikolov, and Y. Bengio, On the di culty of training recurrent neural networks, Proceedings of the 30th International Conference on Machine Learning. PMLR, vol.28, pp.1310-1318, 2013.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang et al., Automatic di erentiation in PyTorch, NIPS 2017 Autodi Workshop: e Future of Gradientbased Machine Learning So ware and Techniques, 2017.

R. Je-rey-pennington, C. Socher, and . Manning, Glove: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp.1532-1543, 2014.

A. Bryan, L. Plummer, C. M. Wang, J. C. Cervantes, J. Caicedo et al., Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2015 IEEE International Conference on Computer Vision (ICCV), pp.2641-2649, 2015.

O. R. Marcelo, P. H. Prates, L. C. Avelar, and . Lamb, Assessing gender bias in machine translation: a case study with google translate, 2019.

L. O-r-press and . Wolf, Using the output embedding to improve language models, 2016.

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN features o -theshelf: An astounding baseline for recognition, Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, pp.512-519, 2014.

J. Sashank, S. Reddi, S. Kale, and . Kumar, On the convergence of Adam and beyond, International Conference on Learning Representations, 2018.

A. , R. Gonzales, L. Mascarell, and R. Sennrich, Improving word sense disambiguation in neural machine translation with sense embeddings, Proceedings of the Second Conference on Machine Translation, vol.1, pp.11-19, 2017.

F. Rosenbla, e perceptron: A probabilistic model for information storage and organization in the brain, Psychological Review, pp.65-386, 1958.

D. E. Rumelhart, G. R. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature, vol.323, p.533, 1986.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision (IJCV), vol.115, issue.3, pp.211-252, 2015.

R. Sanabria, O. Caglayan, S. Palaskar, D. Ellio, L. Barrault et al., How2: A large-scale dataset for multimodal language understanding, Proceedings of the Workshop on Visually Grounded Interaction and Language, 2018.

R. Sanabria, S. Palaskar, and F. Metze, CMU Sinbad's submission for the DSTC7 AVSD challenge, DSTC7 at AAAI2019 workshop, 2019.

M. Andrew, J. L. Saxe, S. Mcclelland, and . Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2013.

J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks, vol.61, pp.85-117, 2015.

M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, vol.45, issue.11, pp.2673-2681, 1997.

H. Schwenk, Continuous space language models for statistical machine translation, Prague Bulletin of Mathematical Linguistics, issue.93, pp.137-146, 2010.
URL : https://hal.archives-ouvertes.fr/hal-01433882

H. Schwenk, Continuous space translation models for phrase-based statistical machine translation, Proceedings of COLING 2012: Posters. e COLING, pp.1071-1080, 2012.

S. Semeniuta, A. Severyn, and E. Barth, Recurrent dropout without memory loss, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. e COLING 2016 Organizing Committee, pp.1757-1766, 2016.

R. Sennrich, O. Firat, K. Cho, A. Birch-mayne, B. Haddow et al., Jozef Mokry, and Maria Nadejde. 2017. Nematus: a Toolkit for Neural Machine Translation, pp.65-68

R. Sennrich, B. Haddow, and A. Birch, Neural machine translation of rare words with subword units, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.1715-1725, 2016.

K. Shah, J. Wang, and L. Specia, SHEF-Multimodal: Grounding machine translation on images, Proceedings of the First Conference on Machine Translation, pp.660-665, 2016.

C. Silberer and M. Lapata, Grounded models of semantic representation, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, pp.1423-1433, 2012.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

X. Song, T. Cohn, and L. Specia, BLEU deconstructed: Designing a be er MT evaluation metric, International Journal of Computational Linguistics and Applications, vol.4, issue.2, p.29, 2013.

L. Specia, S. Frank, K. Sima'an, and D. Ellio, A shared task on multimodal machine translation and crosslingual image description, Proceedings of the First Conference on Machine Translation. Association for Computational Linguistics, pp.543-553, 2016.

N. Srivastava, G. Rey-hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overing, Journal of Machine Learning Research, vol.15, issue.1, pp.1929-1958, 2014.

B. E. Stein, T. R. Stanford, and B. A. Rowland, 2009. e neural basis of multisensory integration in the midbrain: Its organization and maturation, Multisensory integration in auditory and auditory-related areas of cortex, vol.258, pp.4-15

I. Sutskever, O. Vinyals, and . Le, Sequence to sequence learning with neural networks, Proceedings of the 27th International Conference on Neural Information Processing Systems, pp.3104-3112, 2014.

, Theano: A Python framework for fast computation of mathematical expressions, 2016.

J. Tiedemann and Y. Scherrer, Neural machine translation with extended context, Proceedings of the ird Workshop on Discourse in Machine Translation, pp.82-92, 2017.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., A ention is all you need, Advances in Neural Information Processing Systems, vol.30, pp.5998-6008, 2017.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition, pp.3156-3164, 2015.

E. Voita, P. Serdyukov, R. Sennrich, and I. Titov, Context-aware neural machine translation learns anaphora resolution, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.1264-1274, 2018.

K. Vythelingum, Y. Estève, and O. Rosec, Acoustic-dependent phonemic transcription for text-to-speech synthesis, Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, pp.2489-2493, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01870866

P. J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, 1974.

P. J. Werbos, Applications of advances in nonlinear sensitivity analysis, System Modeling and Optimization, pp.762-770, 1982.

B. Xu, N. Wang, T. Chen, and M. Li, Empirical evaluation of recti ed activations in convolutional network, 2015.

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, a end and tell: Neural image caption generation with visual a ention, Proceedings of the 32nd International Conference on Machine Learning (ICML-15). JMLR Workshop and Conference Proceedings, pp.2048-2057, 2015.

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Transactions of the Association for Computational Linguistics, vol.2, pp.67-78, 2014.

Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-a ention learning for visual question answering, e IEEE International Conference on Computer Vision (ICCV), 2017.

D. Ma-hew and . Zeiler, Adadelta: an adaptive learning rate method, 2012.

D. Ma-hew, R. Zeiler, and . Fergus, Visualizing and understanding convolutional networks, pp.818-833, 2014.

J. Zhang, M. Utiyama, E. Sumita, G. Neubig, and S. Nakamura, NICT-NAIST system for WMT17 multimodal translation task, Shared Task Papers. Association for Computational Linguistics, vol.2, pp.477-482, 2017.

R. Zheng, Y. Yang, M. Ma, and L. Huang, Ensemble sequence level training for multimodal MT: OSU-Baidu WMT18 multimodal machine translation system report, Proceedings of the ird Conference on Machine Translation, pp.638-642, 2018.

M. Zhou, R. Cheng, Y. J. Lee, and Z. Yu, A visual a ention grounding neural model for multimodal machine translation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.3643-3653, 2018.

B. Zoph and K. Knight, Multi-source neural translation, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.30-34, 2016.