U. Ahsan and I. Essa, Clustering Social Event Images Using Kernel Canonical Correlation Analysis, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.800-805, 2014.
DOI : 10.1109/CVPRW.2014.124

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.683.6784

A. Amir, J. O. Argill, M. Berg, S. Fu-chang, M. Franz et al., Ibm research trecvid-2004 video retrieval system, Proc. of TREC Video Retrieval Evaluation. Publications, 2004.

G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, Deep canonical correlation analysis, ICML (3), pp.1247-1255, 2013.

S. Avila, N. Thome, M. Cord, E. Valle, and A. D. Araújo, Pooling in image representation: The visual codeword point of view, Computer Vision and Image Understanding, vol.117, issue.5, pp.453-465, 2013.
DOI : 10.1016/j.cviu.2012.09.007

URL : https://hal.archives-ouvertes.fr/hal-01172709

D. M. Blei and M. I. Jordan, Modeling annotated data, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval , SIGIR '03, pp.127-134, 2003.
DOI : 10.1145/860435.860460

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.6686

D. M. Blei, A. Y. Ng, J. , and M. I. , Latent dirichlet allocation, Journal of machine Learning research, vol.3, pp.993-1022, 2003.

A. Bosch, A. Zisserman, and X. Munoz, Representing shape with a spatial pyramid kernel, Proceedings of the 6th ACM international conference on Image and video retrieval, CIVR '07, pp.401-408, 2007.
DOI : 10.1145/1282280.1282340

L. Bottou, Large-scale machine learning with stochastic gradient descent, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pp.177-187, 2010.
DOI : 10.1201/b11429-4

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.419.462

Y. Boureau, J. Ponce, and Y. Lecun, A theoretical analysis of feature pooling in visual recognition, ICML, 2010.

E. Bruni, N. K. Tran, and M. Baroni, Multimodal distributional semantics, J, 2014.

I. Chami, Représentation commune des textes et des images, 2016.

K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, Procedings of the British Machine Vision Conference 2011, 2011.
DOI : 10.5244/C.25.76

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint, 2014.

X. Chen, Y. Mu, S. Yan, C. , and T. , Efficient large-scale image annotation by probabilistic collaborative multi-label propagation, Proceedings of the international conference on Multimedia, MM '10, pp.35-44, 2010.
DOI : 10.1145/1873951.1873959

X. Chen and A. Yuille, Articulated pose estimation by a graphical model with image dependent pairwise relations, Advances in Neural Information Processing Systems (NIPS), 2014.

X. Chen and L. C. Zitnick, Mind's eye: A recurrent visual representation for image caption generation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298856

F. Coelho and C. Ribeiro, Automatic illustration with cross-media retrieval in large-scale collections, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI), pp.25-30, 2011.
DOI : 10.1109/CBMI.2011.5972515

C. Pereira, J. Coviello, E. Doyle, G. Rasiwasia, N. Lanckriet et al., On the role of correlation and abstraction in cross-modal multimedia retrieval, TPAMI, issue.3, pp.36521-535, 2014.

G. Csurka, C. Bray, C. Dance, F. , and L. , Visual categorization with bags of keypoints, Workshop on Statistical Learning in Computer Vision, ECCV, pp.1-22, 2004.

G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, Workshop on Statistical Learning in Computer Vision, ECCV, pp.1-22, 2004.

R. Datta, D. Joshi, J. Li, W. , and J. Z. , Image retrieval, ACM Computing Surveys, vol.40, issue.2, p.5, 2008.
DOI : 10.1145/1348246.1348248

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol.41, issue.6, pp.41391-407, 1990.
DOI : 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.8490

D. Delgado, J. Magalhaes, and N. Correia, Assisted news reading with automated illustration, Proceedings of the international conference on Multimedia, MM '10, pp.1647-1650, 2010.
DOI : 10.1145/1873951.1874311

J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang et al., Subcategory-Aware Object Classification, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.827-834, 2013.
DOI : 10.1109/CVPR.2013.112

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.675.8498

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas et al., FlowNet: Learning Optical Flow with Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.316

URL : http://arxiv.org/pdf/1504.06852

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, pp.303-338, 2010.
DOI : 10.1371/journal.pcbi.0040027

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.167.6629

F. Feng, X. Wang, L. , and R. , Cross-modal retrieval with correspondence autoencoder, Proceedings of the 22nd ACM international conference on Multimedia, pp.7-16, 2014.
DOI : 10.1145/2647868.2654902

Y. Feng and M. Lapata, Topic models for image annotation and text illustration, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp.831-839, 2010.

P. Fischer, A. Dosovitskiy, and T. Brox, Descriptor matching with convolutional neural networks: a comparison to SIFT, 2014.

E. Gabrilovich and S. Markovitch, Computing semantic relatedness using wikipediabased explicit semantic analysis, IJcAI, pp.1606-1611, 2007.

R. B. Girshick, J. Donahue, T. Darrell, M. , and J. , Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2014.81

URL : http://arxiv.org/abs/1311.2524

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics, International Journal of Computer Vision, vol.22, issue.12, pp.210-233, 2014.
DOI : 10.1109/TPAMI.2008.127

URL : http://arxiv.org/pdf/1212.4522

D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor, Canonical Correlation Analysis: An Overview with Application to Learning Methods, Neural Computation, vol.10, issue.12, pp.2639-2664, 2004.
DOI : 10.1093/biomet/58.3.433

URL : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6452&rep=rep1&type=pdf

K. Bibliography-he, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, IEEE International Conference on Computer Vision, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, issue.9, pp.1904-1916, 2015.
DOI : 10.1109/TPAMI.2015.2389824

URL : http://arxiv.org/pdf/1406.4729

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.90

URL : http://arxiv.org/pdf/1512.03385

M. Hodosh, P. Young, and J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial Intelligence Research, vol.47, pp.853-899, 2013.

T. Hofmann, Probabilistic latent semantic indexing, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp.50-57, 1999.
DOI : 10.1145/3130348.3130370

T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Machine Learning, vol.42, issue.1/2, pp.177-196, 2001.
DOI : 10.1023/A:1007617005950

H. Hotelling, RELATIONS BETWEEN TWO SETS OF VARIATES, Biometrika, vol.28, issue.3-4, pp.321-377, 1936.
DOI : 10.1093/biomet/28.3-4.321

Y. Huang, Z. Wu, L. Wang, and T. Tan, Feature Coding in Image Classification: A Comprehensive Study, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.36, issue.3, pp.493-506, 2014.
DOI : 10.1109/TPAMI.2013.113

S. J. Hwang and K. Grauman, Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search, International Journal of Computer Vision, vol.5, issue.2, pp.134-153, 2012.
DOI : 10.1023/A:1023052124951

S. J. Hwang and K. Grauman, Reading between the lines: Object localization using implicit cues from image tags, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.1145-1158, 2012.
DOI : 10.1109/CVPR.2010.5540043

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.167.4308

S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Proceedings of the 32nd International Conference on Machine Learning JMLR Proceedings, pp.448-456, 2015.

H. Jégou, M. Douze, C. Schmid, and P. Pérez, Aggregating local descriptors into a compact image representation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3304-3311, 2010.
DOI : 10.1109/CVPR.2010.5540039

Y. Jia, M. Salzmann, D. , and T. , Learning cross-modality similarity for multinomial data, 2011 International Conference on Computer Vision, pp.2407-2414, 2011.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, pp.675-678, 2014.
DOI : 10.1145/2647868.2654889

A. Joly and O. Buisson, Random maximum margin hashing, CVPR 2011, pp.20-25, 2011.
DOI : 10.1109/CVPR.2011.5995709

URL : https://hal.archives-ouvertes.fr/hal-00642178

D. Joshi, J. Z. Wang, L. , and J. , The story picturing engine, Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval , MIR '04, pp.119-126, 2004.
DOI : 10.1145/1026711.1026732

D. Joshi, J. Z. Wang, L. , and J. , The Story Picturing Engine---a system for automatic text illustration, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), pp.68-89, 2006.
DOI : 10.1145/1126004.1126008

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3128-3137, 2015.
DOI : 10.1109/tpami.2016.2598339

URL : http://arxiv.org/pdf/1412.2306

A. Karpathy, A. Joulin, L. , and F. F. , Deep fragment embeddings for bidirectional image sentence mapping, Advances in neural information processing systems, pp.1889-1897, 2014.

B. Klein, G. Lev, G. Sadeh, and L. Wolf, Associating neural word embeddings with deep image representations using Fisher Vectors, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299073

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting, pp.1106-1114, 2012.
DOI : 10.1162/neco.2009.10-08-881

S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Volume 2 (CVPR'06), pp.2169-2178, 2006.
DOI : 10.1109/CVPR.2006.68

URL : https://hal.archives-ouvertes.fr/inria-00548585

A. Li, S. Shan, X. Chen, and W. Gao, Face recognition based on non-corresponding region matching, 2011 International Conference on Computer Vision, pp.1060-1067, 2011.
DOI : 10.1109/ICCV.2011.6126352

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.391.8555

H. Li, Y. Li, and F. Porikli, Robust Online Visual Tracking with a Single Convolutional Neural Network, Asian Conference on Computer Vision (ACCV), pp.1-16, 2014.
DOI : 10.1007/978-3-319-16814-2_13

Y. Li, D. J. Crandall, and D. P. Huttenlocher, Landmark classification in largescale image collections, IEEE 12th International Conference on Computer Vision, pp.1957-1964, 2009.

L. Liu, L. Wang, and X. Liu, In defense of soft-assignment coding, Proceedings of the 2011 International Conference on Computer Vision, pp.2486-2493, 2011.

N. Liu, E. Dellandréa, L. Chen, C. Zhu, Y. Zhang et al., Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme, Computer Vision and Image Understanding, vol.117, issue.5, pp.493-512, 2013.
DOI : 10.1016/j.cviu.2012.10.009

URL : https://hal.archives-ouvertes.fr/hal-01339139

Y. Liu, D. Zhang, G. Lu, M. , and W. , A survey of content-based image retrieval with high-level semantics, Pattern Recognition, vol.40, issue.1, pp.262-282, 2007.
DOI : 10.1016/j.patcog.2006.04.045

J. Long, E. Shelhamer, D. , and T. , Fully convolutional networks for semantic segmentation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298965

URL : http://arxiv.org/pdf/1411.4038

D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol.60, issue.2, pp.91-110, 2004.
DOI : 10.1023/B:VISI.0000029664.99615.94

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.4931

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space. arXiv preprint, 2013.

G. A. Miller, WordNet: a lexical database for English, Communications of the ACM, vol.38, issue.11, pp.39-41, 1995.
DOI : 10.1145/219717.219748

URL : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1823&rep=rep1&type=pdf

F. Monay and D. Gatica-perez, Modeling Semantic Aspects for Cross-Media Image Indexing, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, issue.10, pp.1802-1817, 2007.
DOI : 10.1109/TPAMI.2007.1097

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.170.6526

H. Müller, P. Clough, T. Deselaers, and B. Caputo, ImageCLEF: Experimental Evaluation in Visual Information Retrieval, 2010.
DOI : 10.1007/978-3-642-15181-1

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee et al., Multimodal deep learning, Proceedings of the 28th international conference on machine learning (ICML-11), pp.689-696, 2011.

D. Novak, M. Batko, and P. Zezula, Large-scale Image Retrieval using Neural Net Descriptors, Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '15, pp.1039-1040, 2015.
DOI : 10.1007/978-3-319-10085-2_4

P. Over, J. Fiscus, G. Sanders, D. Joy, M. Michel et al., Trecvid 2014?an overview of the goals, tasks, data, evaluation mechanisms and metrics What is holding back convnets for detection?, Proceedings of TRECVID, page 52. BIBLIOGRAPHY Pepik, 2014.

F. Perronnin and C. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.
DOI : 10.1109/CVPR.2007.383266

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.7388

F. Perronnin and D. Larlus, Fisher vectors meet Neural Networks: A hybrid classification architecture, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298998

F. Perronnin, J. Sánchez, and Y. Liu, Large-scale image categorization with explicit data embedding, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.2297-2304, 2010.
DOI : 10.1109/CVPR.2010.5539914

F. Perronnin, J. Sánchez, and T. Mensink, Improving the fisher kernel for largescale image classification, Proceedings of the 11th European Conference on Computer Vision: Part IV, ECCV'10, pp.143-156, 2010.
DOI : 10.1007/978-3-642-15561-1_11

URL : https://hal.archives-ouvertes.fr/inria-00548630

D. Putthividhy, H. T. Attias, and S. S. Nagarajan, Topic regression multimodal latent dirichlet allocation for image annotation, Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp.3408-3415, 2010.
DOI : 10.1109/cvpr.2010.5540000

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.648.8796

V. Ranjan, N. Rasiwasia, and C. Jawahar, Multi-label Cross-Modal Retrieval, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4094-4102, 2015.
DOI : 10.1109/ICCV.2015.466

C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, Collecting image annotations using amazon's mechanical turk, Proceedings of the NAACL HLT, 2010.

N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal, Cluster canonical correlation analysis, AISTATS, pp.823-831, 2014.

N. Rasiwasia and N. Vasconcelos, Scene classification with low-dimensional semantic spaces and weak supervision, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-6, 2008.
DOI : 10.1109/CVPR.2008.4587372

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.163.2182

N. Bibliography-rasiwasia and N. Vasconcelos, Holistic context modeling using semantic co-occurrences, CVPR, pp.1889-1895, 2009.

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, Cnn features off-theshelf: an astounding baseline for recognition, 2014.
DOI : 10.1109/cvprw.2014.131

URL : http://arxiv.org/pdf/1403.6382

S. Robertson and H. Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond, Foundations and Trends?? in Information Retrieval, vol.3, issue.4, pp.333-389, 2009.
DOI : 10.1561/1500000019

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.5282

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision, vol.1010, issue.1, pp.211-252, 2015.
DOI : 10.1007/978-3-642-15555-0_11

URL : http://arxiv.org/abs/1409.0575

G. Salton and M. J. Mcgill, Introduction to Modern Information Retrieval, 1986.

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image Classification with the Fisher Vector: Theory and Practice, International Journal of Computer Vision, vol.73, issue.2, pp.222-245, 2013.
DOI : 10.1007/s11263-006-9794-4

F. Schroff, D. Kalenichenko, and J. And-philbin, FaceNet: A unified embedding for face recognition and clustering, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298682

URL : http://arxiv.org/abs/1503.03832

. Overfeat, Integrated recognition, localization and detection using convolutional networks, International Conference on Learning Representations, p.16, 2014.

A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, Generalized Multiview Analysis: A discriminative latent space, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.2160-2167, 2012.
DOI : 10.1109/CVPR.2012.6247923

H. T. Shen, B. C. Ooi, and K. Tan, Giving meanings to WWW images, Proceedings of the eighth ACM international conference on Multimedia , MULTIMEDIA '00, pp.39-47, 2000.
DOI : 10.1145/354384.376098

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.66.1456

A. Bibliography-simonyan, K. Zisserman, and A. , Very deep convolutional networks for large-scale image recognition, 2014.

J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings Ninth IEEE International Conference on Computer Vision, pp.1470-1477, 2003.
DOI : 10.1109/ICCV.2003.1238663

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.323.9793

A. F. Smeaton, P. Over, and W. Kraaij, Evaluation campaigns and TRECVid, Proceedings of the 8th ACM international workshop on Multimedia information retrieval , MIR '06, pp.321-330, 2006.
DOI : 10.1145/1178677.1178722

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.329.3415

A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Contentbased image retrieval at the end of the early years, IEEE Trans. Pattern Anal. Mach. Intell, issue.12, pp.221349-1380, 2000.

C. G. Snoek, M. Worring, and A. W. Smeulders, Early versus late fusion in semantic video analysis, Proceedings of the 13th annual ACM international conference on Multimedia , MULTIMEDIA '05, pp.399-402, 2005.
DOI : 10.1145/1101149.1101236

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.5928

R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded compositional semantics for finding and describing images with sentences, Transactions of the Association for Computational Linguistics, pp.207-218, 2014.

R. K. Srihari, Z. Zhang, and A. Rao, Intelligent indexing and semantic retrieval of multimodal documents, Information Retrieval, vol.2, issue.2/3, pp.245-275, 2000.
DOI : 10.1023/A:1009962928226

N. Srivastava and R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, Advances in neural information processing systems, pp.2222-2230, 2012.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1-9, 2015.
DOI : 10.1109/CVPR.2015.7298594

URL : http://arxiv.org/abs/1409.4842

Y. Tamaazousti, L. Borgne, H. Popescu, A. Gadeski, E. Ginsca et al., Vision-language integration using constrained local semantic features, Computer Vision and Image Understanding, 2017.
DOI : 10.1016/j.cviu.2017.05.017

Y. Tammazousti, L. Borgne, H. Popescu, and A. , Constrained Local Enhancement of Semantic Features by Content-Based Sparsity, Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR '16, 2016.
DOI : 10.1145/2733373.2806244

T. Q. Tran, L. Borgne, H. Crucianu, and M. , Combining Generic and Specific Information for Cross-modal Retrieval, Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ICMR '15, pp.551-554, 2015.
DOI : 10.1145/2502081.2502087

T. Q. Tran, L. Borgne, H. Crucianu, and M. , Aggregating Image and Text Quantized Correlated Components, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.225

T. Q. Tran, L. Borgne, H. Crucianu, and M. , Cross-modal Classification by Completing Unimodal Representations, Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion, iV&L-MM '16, pp.17-25, 2016.
DOI : 10.1109/CVPR.2009.5206816

R. Udupa and M. Khapra, Improving the multilingual user experience of wikipedia using cross-language name search, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp.492-500, 2010.

N. Vasconcelos, Minimum Probability of Error Image Retrieval, IEEE Transactions on Signal Processing, vol.52, issue.8, pp.2322-2336, 2004.
DOI : 10.1109/TSP.2004.831125

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.163.1173

M. Villegas, R. Paredes, and B. Thomee, Overview of the imageclef 2013 scalable concept image annotation subtask, 2013.

V. Vukoti?, C. Raymond, and G. Gravier, Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications, Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR '16, pp.343-346, 2016.
DOI : 10.1007/s10994-010-5198-3

G. Wang, D. Hoiem, and D. Forsyth, Building text features for object image classification, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.1367-1374, 2009.
DOI : 10.1109/CVPR.2009.5206816

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang et al., Locality-constrained Linear Coding for image classification, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3360-3367, 2010.
DOI : 10.1109/CVPR.2010.5540018

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.175.2312

K. Wang, R. He, L. Wang, W. Wang, and T. Tan, Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.10, 2015.
DOI : 10.1109/TPAMI.2015.2505311

K. Wang, Q. Yin, W. Wang, S. Wu, W. et al., A comprehensive survey on cross-modal retrieval, 2016.

L. Wang, Y. Li, and S. Lazebnik, Learning Deep Structure-Preserving Image-Text Embeddings, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.541

URL : http://arxiv.org/abs/1511.06078

Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval, Proceedings of the ACM International Conference on Multimedia, MM '14, pp.307-316, 2014.
DOI : 10.1109/TMM.2013.2291214

Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong et al., Cnn: Single-label to multi-label. arXiv preprint, 2014.

J. Weston, S. Bengio, and N. Usunier, Wsabie: Scaling up to large vocabulary image annotation, IJCAI, pp.2764-2770, 2011.

D. Williams and G. Hinton, Learning representations by back-propagating errors, Nature, vol.323, pp.533-536, 1986.

S. Xie and Z. Tu, Holistically-nested edge detection, The IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1007/s11263-017-1004-z

URL : http://arxiv.org/abs/1504.06375

F. Yan and K. Mikolajczyk, Deep correlation for matching images and text, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298966

J. Yang, K. Yu, Y. Gong, and T. Huang, Linear spatial pyramid matching using sparse coding for image classification, Computer Vision and Pattern Recognition CVPR 2009. IEEE Conference on, pp.1794-1801, 2009.

T. Yao, T. Mei, and C. Ngo, Learning Query and Image Similarities with Ranking Canonical Correlation Analysis, 2015 IEEE International Conference on Computer Vision (ICCV), pp.28-36, 2015.
DOI : 10.1109/ICCV.2015.12

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, pp.67-78, 2014.

J. Zbontar and Y. Lecun, Computing the stereo matching cost with a convolutional neural network, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298767

M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, 1311.
DOI : 10.1007/978-3-319-10590-1_53

URL : http://arxiv.org/abs/1311.2901

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, Learning deep features for scene recognition using places database, Advances in Neural Information Processing Systems 27, pp.487-495, 2014.
DOI : 10.1109/tpami.2017.2723009

X. S. Zhou and T. S. Huang, Cbir: from low-level features to high-level semantics, 2000.
DOI : 10.1117/12.382975

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.6641

A. Znaidia, Handling imperfections for multimodal image annotation, 2014.
URL : https://hal.archives-ouvertes.fr/tel-01012009

A. Znaidia, A. Shabou, L. Borgne, H. Hudelot, C. et al., Bag-ofmultimedia-words for image classification, ICPR, pp.1509-1512, 2012.