Skip to Main content Skip to Navigation

Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data

Vedran Vukotic 1
1 LinkMedia - Creating and exploiting explicit links between multimedia fragments
Inria Rennes – Bretagne Atlantique , IRISA-D6 - MEDIA ET INTERACTIONS
Abstract : In this dissertation, the thesis that deep neural networks are suited for analysis of visual, textual and fused visual and textual content is discussed. This work evaluates the ability of deep neural networks to learn automatic multimodal representations in either unsupervised or supervised manners and brings the following main contributions:1) Recurrent neural networks for spoken language understanding (slot filling): different architectures are compared for this task with the aim of modeling both the input context and output label dependencies.2) Action prediction from single images: we propose an architecture that allow us to predict human actions from a single image. The architecture is evaluated on videos, by utilizing solely one frame as input.3) Bidirectional multimodal encoders: the main contribution of this thesis consists of neural architecture that translates from one modality to the other and conversely and offers and improved multimodal representation space where the initially disjoint representations can translated and fused. This enables for improved multimodal fusion of multiple modalities. The architecture was extensively studied an evaluated in international benchmarks within the task of video hyperlinking where it defined the state of the art today.4) Generative adversarial networks for multimodal fusion: continuing on the topic of multimodal fusion, we evaluate the possibility of using conditional generative adversarial networks to lean multimodal representations in addition to providing multimodal representations, generative adversarial networks permit to visualize the learned model directly in the image domain.
Document type :
Complete list of metadata

Cited literature [141 references]  Display  Hide  Download
Contributor : Abes Star :  Contact Connect in order to contact the contributor
Submitted on : Wednesday, December 13, 2017 - 11:56:07 AM
Last modification on : Wednesday, October 27, 2021 - 7:06:55 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01629669, version 2


Vedran Vukotic. Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data. Artificial Intelligence [cs.AI]. INSA de Rennes, 2017. English. ⟨NNT : 2017ISAR0015⟩. ⟨tel-01629669v2⟩



Record views


Files downloads