Skip to Main content Skip to Navigation

Robust and comprehensive joint image-text representations

Abstract : This thesis investigates the joint modeling of visual and textual content of multimedia documents to address cross-modal problems. Such tasks require the ability to match information across modalities. A common representation space, obtained by eg Kernel Canonical Correlation Analysis, on which images and text can be both represented and directly compared is a generally adopted solution.Nevertheless, such a joint space still suffers from several deficiencies that may hinder the performance of cross-modal tasks. An important contribution of this thesis is therefore to identify two major limitations of such a space. The first limitation concerns information that is poorly represented on the common space yet very significant for a retrieval task. The second limitation consists in a separation between modalities on the common space, which leads to coarse cross-modal matching. To deal with the first limitation concerning poorly-represented data, we put forward a model which first identifies such information and then finds ways to combine it with data that is relatively well-represented on the joint space. Evaluations on emph{text illustration} tasks show that by appropriately identifying and taking such information into account, the results of cross-modal retrieval can be strongly improved. The major work in this thesis aims to cope with the separation between modalities on the joint space to enhance the performance of cross-modal tasks.We propose two representation methods for bi-modal or uni-modal documents that aggregate information from both the visual and textual modalities projected on the joint space. Specifically, for uni-modal documents we suggest a completion process relying on an auxiliary dataset to find the corresponding information in the absent modality and then use such information to build a final bi-modal representation for a uni-modal document. Evaluations show that our approaches achieve state-of-the-art results on several standard and challenging datasets for cross-modal retrieval or bi-modal and cross-modal classification.
Complete list of metadata

Cited literature [127 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Thursday, September 21, 2017 - 4:07:06 PM
Last modification on : Wednesday, October 14, 2020 - 3:59:20 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01591614, version 1



Thi Quynh Nhi Tran. Robust and comprehensive joint image-text representations. Image Processing [eess.IV]. Conservatoire national des arts et metiers - CNAM, 2017. English. ⟨NNT : 2017CNAM1096⟩. ⟨tel-01591614⟩



Record views


Files downloads