HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data

Vedran Vukotic 1
1 LinkMedia - Creating and exploiting explicit links between multimedia fragments
IRISA-D6 - MEDIA ET INTERACTIONS, Inria Rennes – Bretagne Atlantique
Abstract : In this dissertation, the thesis that deep neural networks are suited for analysis of visual, textual and fused visual and textual content is discussed. This work evaluates the ability of deep neural networks to learn automatic multimodal representations in either unsupervised or supervised manners and brings the following main contributions:1) Recurrent neural networks for spoken language understanding (slot filling): different architectures are compared for this task with the aim of modeling both the input context and output label dependencies.2) Action prediction from single images: we propose an architecture that allow us to predict human actions from a single image. The architecture is evaluated on videos, by utilizing solely one frame as input.3) Bidirectional multimodal encoders: the main contribution of this thesis consists of neural architecture that translates from one modality to the other and conversely and offers and improved multimodal representation space where the initially disjoint representations can translated and fused. This enables for improved multimodal fusion of multiple modalities. The architecture was extensively studied an evaluated in international benchmarks within the task of video hyperlinking where it defined the state of the art today.4) Generative adversarial networks for multimodal fusion: continuing on the topic of multimodal fusion, we evaluate the possibility of using conditional generative adversarial networks to lean multimodal representations in addition to providing multimodal representations, generative adversarial networks permit to visualize the learned model directly in the image domain.
Document type :
Complete list of metadata

Cited literature [141 references]  Display  Hide  Download

Contributor : Abes Star :  Contact
Submitted on : Wednesday, December 13, 2017 - 11:56:07 AM
Last modification on : Friday, April 8, 2022 - 4:08:03 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01629669, version 2


Vedran Vukotic. Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data. Artificial Intelligence [cs.AI]. INSA de Rennes, 2017. English. ⟨NNT : 2017ISAR0015⟩. ⟨tel-01629669v2⟩



Record views


Files downloads