Skip to Main content Skip to Navigation

Multimodal and Interactive Models for Visually Grounded Language Learning

Florian Strub 1, 2, 3
3 SEQUEL - Sequential Learning
Inria Lille - Nord Europe, CRIStAL - Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189
Abstract : While our representation of the world is shaped by our perceptions, our languages, and our interactions, they have traditionally been distinct fields of study in machine learning. Fortunately, this partitioning started opening up with the recent advents of deep learning methods, which standardized raw feature extraction across communities. However, multimodal neural architectures are still at their beginning, and deep reinforcement learning is often limited to constrained environments. Yet, we ideally aim to develop large-scale multimodal and interactive models towards correctly apprehending the complexity of the world. As a first milestone, this thesis focuses on visually grounded language learning for three reasons (i) they are both well-studied modalities across different scientific fields (ii) it builds upon deep learning breakthroughs in natural language processing and computer vision (ii) the interplay between language and vision has been acknowledged in cognitive science. More precisely, we first designed the \GW game for assessing visually grounded language understanding of the models: two players collaborate to locate a hidden object in an image by asking a sequence of questions. We then introduce modulation as a novel deep multimodal mechanism, and we show that it successfully fuses visual and linguistic representations by taking advantage of the hierarchical structure of neural networks. Finally, we investigate how reinforcement learning can support visually grounded language learning and cement the underlying multimodal representation. We show that such interactive learning leads to consistent language strategies but gives raise to new research issues.
Complete list of metadatas
Contributor : Florian Strub <>
Submitted on : Saturday, November 21, 2020 - 8:57:43 PM
Last modification on : Friday, January 8, 2021 - 3:28:19 AM


Files produced by the author(s)


  • HAL Id : tel-03018038, version 1


Florian Strub. Multimodal and Interactive Models for Visually Grounded Language Learning. Neural and Evolutionary Computing [cs.NE]. Université de Lille; École doctorale, ED SPI 074 : Sciences pour l'Ingénieur, 2020. English. ⟨tel-03018038⟩



Record views


Files downloads