HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Semantic representations of images and videos

Abstract : Recent research in Deep Learning has sent the quality of results in multimedia tasks rocketing: thanks to new big datasets of annotated images and videos, Deep Neural Networks (DNN) have outperformed other models in most cases. In this thesis, we aim at developing DNN models for automatically deriving semantic representations of images and videos. In particular we focus on two main tasks : vision-text matching and image/video automatic captioning. Addressing the matching task can be done by comparing visual objects and texts in a visual space, a textual space or a multimodal space. Based on recent works on capsule networks, we define two novel models to address the vision-text matching problem: Recurrent Capsule Networks and Gated Recurrent Capsules. In image and video captioning, we have to tackle a challenging task where a visual object has to be analyzed, and translated into a textual description in natural language. For that purpose, we propose two novel curriculum learning methods. Moreover regarding video captioning, analyzing videos requires not only to parse still images, but also to draw correspondences through time. We propose a novel Learned Spatio-Temporal Adaptive Pooling method for video captioning that combines spatial and temporal analysis. Extensive experiments on standard datasets assess the interest of our models and methods with respect to existing works.
Document type :
Complete list of metadata

Contributor : Abes Star :  Contact
Submitted on : Tuesday, September 28, 2021 - 10:04:28 AM
Last modification on : Tuesday, November 16, 2021 - 5:13:18 AM
Long-term archiving on: : Wednesday, December 29, 2021 - 6:14:04 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03356457, version 1


Danny Francis. Semantic representations of images and videos. Artificial Intelligence [cs.AI]. Sorbonne Université, 2019. English. ⟨NNT : 2019SORUS605⟩. ⟨tel-03356457⟩



Record views


Files downloads