Skip to Main content Skip to Navigation
Theses

Deep Multimodal Learning for Vision and Language Processing

Abstract : Digital technologies have become instrumental in transforming our society. Recent statistical methods have been successfully deployed to automate the processing of the growing amount of images, videos, and texts we produce daily. In particular, deep neural networks have been adopted by the computer vision and natural language processing communities for their ability to perform accurate im- age recognition and text understanding once trained on big sets of data. Advances in both communities built the groundwork for new research problems at the intersection of vision and language. Integrating language into visual recognition could have an important impact on human life through the creation of real-world applications such as next-generation search engines or AI assistants. In the first part of this thesis, we focus on systems for cross-modal text-image retrieval. We propose a learning strategy to efficiently align both modalities while structuring the retrieval space with semantic information. In the second part, we focus on systems able to answer questions about an image. We propose a multimodal architecture that iteratively fuses the visual and textual modalities using a factorized bilinear model while modeling pairwise relationships between each region of the image. In the last part, we address the issues related to biases in the modeling. We propose a learning strategy to reduce the language biases which are commonly present in visual question answering systems.
Complete list of metadata

https://tel.archives-ouvertes.fr/tel-03140942
Contributor : Remi Cadene <>
Submitted on : Tuesday, February 16, 2021 - 1:38:29 PM
Last modification on : Tuesday, March 23, 2021 - 9:28:03 AM

File

1. mémoire de thèse.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-03140942, version 1

Citation

Rémi Cadène. Deep Multimodal Learning for Vision and Language Processing. Computer Vision and Pattern Recognition [cs.CV]. Sorbonne Université UPMC, 2020. English. ⟨tel-03140942⟩

Share

Metrics

Record views

131

Files downloads

20