U. Ahsan and I. Essa, Clustering Social Event Images Using Kernel Canonical Correlation Analysis, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.800-805, 2014.
DOI : 10.1109/CVPRW.2014.124
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.683.6784

A. Amir, J. O. Argill, M. Berg, S. Fu-chang, M. Franz et al., Ibm research trecvid-2004 video retrieval system, Proc. of TREC Video Retrieval Evaluation. Publications, 2004.

G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, Deep canonical correlation analysis, ICML (3), pp.1247-1255, 2013.

S. Avila, N. Thome, M. Cord, E. Valle, and A. D. Araújo, Pooling in image representation: The visual codeword point of view, Computer Vision and Image Understanding, vol.117, issue.5, pp.453-465, 2013.
DOI : 10.1016/j.cviu.2012.09.007
URL : https://hal.archives-ouvertes.fr/hal-01172709

D. M. Blei and M. I. Jordan, Modeling annotated data, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval , SIGIR '03, pp.127-134, 2003.
DOI : 10.1145/860435.860460
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.6686

D. M. Blei, A. Y. Ng, J. , and M. I. , Latent dirichlet allocation, Journal of machine Learning research, vol.3, pp.993-1022, 2003.

A. Bosch, A. Zisserman, and X. Munoz, Representing shape with a spatial pyramid kernel, Proceedings of the 6th ACM international conference on Image and video retrieval, CIVR '07, pp.401-408, 2007.
DOI : 10.1145/1282280.1282340

L. Bottou, Large-scale machine learning with stochastic gradient descent, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pp.177-187, 2010.
DOI : 10.1201/b11429-4
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.419.462

Y. Boureau, J. Ponce, and Y. Lecun, A theoretical analysis of feature pooling in visual recognition, ICML, 2010.

E. Bruni, N. K. Tran, and M. Baroni, Multimodal distributional semantics, J, 2014.

I. Chami, Représentation commune des textes et des images, 2016.

K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, Procedings of the British Machine Vision Conference 2011, 2011.
DOI : 10.5244/C.25.76

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint, 2014.

X. Chen, Y. Mu, S. Yan, C. , and T. , Efficient large-scale image annotation by probabilistic collaborative multi-label propagation, Proceedings of the international conference on Multimedia, MM '10, pp.35-44, 2010.
DOI : 10.1145/1873951.1873959

X. Chen and A. Yuille, Articulated pose estimation by a graphical model with image dependent pairwise relations, Advances in Neural Information Processing Systems (NIPS), 2014.

X. Chen and L. C. Zitnick, Mind's eye: A recurrent visual representation for image caption generation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298856

F. Coelho and C. Ribeiro, Automatic illustration with cross-media retrieval in large-scale collections, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI), pp.25-30, 2011.
DOI : 10.1109/CBMI.2011.5972515

C. Pereira, J. Coviello, E. Doyle, G. Rasiwasia, N. Lanckriet et al., On the role of correlation and abstraction in cross-modal multimedia retrieval, TPAMI, issue.3, pp.36521-535, 2014.

G. Csurka, C. Bray, C. Dance, F. , and L. , Visual categorization with bags of keypoints, Workshop on Statistical Learning in Computer Vision, ECCV, pp.1-22, 2004.

G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, Workshop on Statistical Learning in Computer Vision, ECCV, pp.1-22, 2004.

R. Datta, D. Joshi, J. Li, W. , and J. Z. , Image retrieval, ACM Computing Surveys, vol.40, issue.2, p.5, 2008.
DOI : 10.1145/1348246.1348248

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol.41, issue.6, pp.41391-407, 1990.
DOI : 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.8490

D. Delgado, J. Magalhaes, and N. Correia, Assisted news reading with automated illustration, Proceedings of the international conference on Multimedia, MM '10, pp.1647-1650, 2010.
DOI : 10.1145/1873951.1874311

J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang et al., Subcategory-Aware Object Classification, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.827-834, 2013.
DOI : 10.1109/CVPR.2013.112
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.675.8498

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas et al., FlowNet: Learning Optical Flow with Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.316
URL : http://arxiv.org/pdf/1504.06852

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, pp.303-338, 2010.
DOI : 10.1371/journal.pcbi.0040027
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.167.6629

F. Feng, X. Wang, L. , and R. , Cross-modal retrieval with correspondence autoencoder, Proceedings of the 22nd ACM international conference on Multimedia, pp.7-16, 2014.
DOI : 10.1145/2647868.2654902

Y. Feng and M. Lapata, Topic models for image annotation and text illustration, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp.831-839, 2010.

P. Fischer, A. Dosovitskiy, and T. Brox, Descriptor matching with convolutional neural networks: a comparison to SIFT, 2014.

E. Gabrilovich and S. Markovitch, Computing semantic relatedness using wikipediabased explicit semantic analysis, IJcAI, pp.1606-1611, 2007.

R. B. Girshick, J. Donahue, T. Darrell, M. , and J. , Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2014.81
URL : http://arxiv.org/abs/1311.2524

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics, International Journal of Computer Vision, vol.22, issue.12, pp.210-233, 2014.
DOI : 10.1109/TPAMI.2008.127
URL : http://arxiv.org/pdf/1212.4522

D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor, Canonical Correlation Analysis: An Overview with Application to Learning Methods, Neural Computation, vol.10, issue.12, pp.2639-2664, 2004.
DOI : 10.1093/biomet/58.3.433
URL : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6452&rep=rep1&type=pdf

K. Bibliography-he, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, IEEE International Conference on Computer Vision, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, issue.9, pp.1904-1916, 2015.
DOI : 10.1109/TPAMI.2015.2389824
URL : http://arxiv.org/pdf/1406.4729

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.90
URL : http://arxiv.org/pdf/1512.03385

M. Hodosh, P. Young, and J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial Intelligence Research, vol.47, pp.853-899, 2013.

T. Hofmann, Probabilistic latent semantic indexing, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp.50-57, 1999.
DOI : 10.1145/3130348.3130370

T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Machine Learning, vol.42, issue.1/2, pp.177-196, 2001.
DOI : 10.1023/A:1007617005950

H. Hotelling, RELATIONS BETWEEN TWO SETS OF VARIATES, Biometrika, vol.28, issue.3-4, pp.321-377, 1936.
DOI : 10.1093/biomet/28.3-4.321

Y. Huang, Z. Wu, L. Wang, and T. Tan, Feature Coding in Image Classification: A Comprehensive Study, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.36, issue.3, pp.493-506, 2014.
DOI : 10.1109/TPAMI.2013.113

S. J. Hwang and K. Grauman, Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search, International Journal of Computer Vision, vol.5, issue.2, pp.134-153, 2012.
DOI : 10.1023/A:1023052124951

S. J. Hwang and K. Grauman, Reading between the lines: Object localization using implicit cues from image tags, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.1145-1158, 2012.
DOI : 10.1109/CVPR.2010.5540043
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.167.4308

S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Proceedings of the 32nd International Conference on Machine Learning JMLR Proceedings, pp.448-456, 2015.

H. Jégou, M. Douze, C. Schmid, and P. Pérez, Aggregating local descriptors into a compact image representation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3304-3311, 2010.
DOI : 10.1109/CVPR.2010.5540039

Y. Jia, M. Salzmann, D. , and T. , Learning cross-modality similarity for multinomial data, 2011 International Conference on Computer Vision, pp.2407-2414, 2011.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, pp.675-678, 2014.
DOI : 10.1145/2647868.2654889

A. Joly and O. Buisson, Random maximum margin hashing, CVPR 2011, pp.20-25, 2011.
DOI : 10.1109/CVPR.2011.5995709
URL : https://hal.archives-ouvertes.fr/hal-00642178

D. Joshi, J. Z. Wang, L. , and J. , The story picturing engine, Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval , MIR '04, pp.119-126, 2004.
DOI : 10.1145/1026711.1026732

D. Joshi, J. Z. Wang, L. , and J. , The Story Picturing Engine---a system for automatic text illustration, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), pp.68-89, 2006.
DOI : 10.1145/1126004.1126008

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3128-3137, 2015.
DOI : 10.1109/tpami.2016.2598339
URL : http://arxiv.org/pdf/1412.2306

A. Karpathy, A. Joulin, L. , and F. F. , Deep fragment embeddings for bidirectional image sentence mapping, Advances in neural information processing systems, pp.1889-1897, 2014.

B. Klein, G. Lev, G. Sadeh, and L. Wolf, Associating neural word embeddings with deep image representations using Fisher Vectors, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299073

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting, pp.1106-1114, 2012.
DOI : 10.1162/neco.2009.10-08-881

S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Volume 2 (CVPR'06), pp.2169-2178, 2006.
DOI : 10.1109/CVPR.2006.68
URL : https://hal.archives-ouvertes.fr/inria-00548585

A. Li, S. Shan, X. Chen, and W. Gao, Face recognition based on non-corresponding region matching, 2011 International Conference on Computer Vision, pp.1060-1067, 2011.
DOI : 10.1109/ICCV.2011.6126352
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.391.8555

H. Li, Y. Li, and F. Porikli, Robust Online Visual Tracking with a Single Convolutional Neural Network, Asian Conference on Computer Vision (ACCV), pp.1-16, 2014.
DOI : 10.1007/978-3-319-16814-2_13

Y. Li, D. J. Crandall, and D. P. Huttenlocher, Landmark classification in largescale image collections, IEEE 12th International Conference on Computer Vision, pp.1957-1964, 2009.

L. Liu, L. Wang, and X. Liu, In defense of soft-assignment coding, Proceedings of the 2011 International Conference on Computer Vision, pp.2486-2493, 2011.

N. Liu, E. Dellandréa, L. Chen, C. Zhu, Y. Zhang et al., Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme, Computer Vision and Image Understanding, vol.117, issue.5, pp.493-512, 2013.
DOI : 10.1016/j.cviu.2012.10.009
URL : https://hal.archives-ouvertes.fr/hal-01339139

Y. Liu, D. Zhang, G. Lu, M. , and W. , A survey of content-based image retrieval with high-level semantics, Pattern Recognition, vol.40, issue.1, pp.262-282, 2007.
DOI : 10.1016/j.patcog.2006.04.045

J. Long, E. Shelhamer, D. , and T. , Fully convolutional networks for semantic segmentation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298965
URL : http://arxiv.org/pdf/1411.4038

D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol.60, issue.2, pp.91-110, 2004.
DOI : 10.1023/B:VISI.0000029664.99615.94
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.4931

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space. arXiv preprint, 2013.

G. A. Miller, WordNet: a lexical database for English, Communications of the ACM, vol.38, issue.11, pp.39-41, 1995.
DOI : 10.1145/219717.219748
URL : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1823&rep=rep1&type=pdf

F. Monay and D. Gatica-perez, Modeling Semantic Aspects for Cross-Media Image Indexing, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, issue.10, pp.1802-1817, 2007.
DOI : 10.1109/TPAMI.2007.1097
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.170.6526

H. Müller, P. Clough, T. Deselaers, and B. Caputo, ImageCLEF: Experimental Evaluation in Visual Information Retrieval, 2010.
DOI : 10.1007/978-3-642-15181-1

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee et al., Multimodal deep learning, Proceedings of the 28th international conference on machine learning (ICML-11), pp.689-696, 2011.

D. Novak, M. Batko, and P. Zezula, Large-scale Image Retrieval using Neural Net Descriptors, Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '15, pp.1039-1040, 2015.
DOI : 10.1007/978-3-319-10085-2_4

P. Over, J. Fiscus, G. Sanders, D. Joy, M. Michel et al., Trecvid 2014?an overview of the goals, tasks, data, evaluation mechanisms and metrics What is holding back convnets for detection?, Proceedings of TRECVID, page 52. BIBLIOGRAPHY Pepik, 2014.

F. Perronnin and C. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.
DOI : 10.1109/CVPR.2007.383266
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.7388

F. Perronnin and D. Larlus, Fisher vectors meet Neural Networks: A hybrid classification architecture, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298998

F. Perronnin, J. Sánchez, and Y. Liu, Large-scale image categorization with explicit data embedding, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.2297-2304, 2010.
DOI : 10.1109/CVPR.2010.5539914

F. Perronnin, J. Sánchez, and T. Mensink, Improving the fisher kernel for largescale image classification, Proceedings of the 11th European Conference on Computer Vision: Part IV, ECCV'10, pp.143-156, 2010.
DOI : 10.1007/978-3-642-15561-1_11
URL : https://hal.archives-ouvertes.fr/inria-00548630

D. Putthividhy, H. T. Attias, and S. S. Nagarajan, Topic regression multimodal latent dirichlet allocation for image annotation, Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp.3408-3415, 2010.
DOI : 10.1109/cvpr.2010.5540000
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.648.8796

V. Ranjan, N. Rasiwasia, and C. Jawahar, Multi-label Cross-Modal Retrieval, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4094-4102, 2015.
DOI : 10.1109/ICCV.2015.466

C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, Collecting image annotations using amazon's mechanical turk, Proceedings of the NAACL HLT, 2010.

N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal, Cluster canonical correlation analysis, AISTATS, pp.823-831, 2014.

N. Rasiwasia and N. Vasconcelos, Scene classification with low-dimensional semantic spaces and weak supervision, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-6, 2008.
DOI : 10.1109/CVPR.2008.4587372
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.163.2182

N. Bibliography-rasiwasia and N. Vasconcelos, Holistic context modeling using semantic co-occurrences, CVPR, pp.1889-1895, 2009.

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, Cnn features off-theshelf: an astounding baseline for recognition, 2014.
DOI : 10.1109/cvprw.2014.131
URL : http://arxiv.org/pdf/1403.6382

S. Robertson and H. Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond, Foundations and Trends?? in Information Retrieval, vol.3, issue.4, pp.333-389, 2009.
DOI : 10.1561/1500000019
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.5282

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision, vol.1010, issue.1, pp.211-252, 2015.
DOI : 10.1007/978-3-642-15555-0_11
URL : http://arxiv.org/abs/1409.0575

G. Salton and M. J. Mcgill, Introduction to Modern Information Retrieval, 1986.

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image Classification with the Fisher Vector: Theory and Practice, International Journal of Computer Vision, vol.73, issue.2, pp.222-245, 2013.
DOI : 10.1007/s11263-006-9794-4

F. Schroff, D. Kalenichenko, and J. And-philbin, FaceNet: A unified embedding for face recognition and clustering, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298682
URL : http://arxiv.org/abs/1503.03832

. Overfeat, Integrated recognition, localization and detection using convolutional networks, International Conference on Learning Representations, p.16, 2014.

A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, Generalized Multiview Analysis: A discriminative latent space, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.2160-2167, 2012.
DOI : 10.1109/CVPR.2012.6247923

H. T. Shen, B. C. Ooi, and K. Tan, Giving meanings to WWW images, Proceedings of the eighth ACM international conference on Multimedia , MULTIMEDIA '00, pp.39-47, 2000.
DOI : 10.1145/354384.376098
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.66.1456

A. Bibliography-simonyan, K. Zisserman, and A. , Very deep convolutional networks for large-scale image recognition, 2014.

J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings Ninth IEEE International Conference on Computer Vision, pp.1470-1477, 2003.
DOI : 10.1109/ICCV.2003.1238663
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.323.9793

A. F. Smeaton, P. Over, and W. Kraaij, Evaluation campaigns and TRECVid, Proceedings of the 8th ACM international workshop on Multimedia information retrieval , MIR '06, pp.321-330, 2006.
DOI : 10.1145/1178677.1178722
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.329.3415

A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Contentbased image retrieval at the end of the early years, IEEE Trans. Pattern Anal. Mach. Intell, issue.12, pp.221349-1380, 2000.

C. G. Snoek, M. Worring, and A. W. Smeulders, Early versus late fusion in semantic video analysis, Proceedings of the 13th annual ACM international conference on Multimedia , MULTIMEDIA '05, pp.399-402, 2005.
DOI : 10.1145/1101149.1101236
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.78.5928

R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded compositional semantics for finding and describing images with sentences, Transactions of the Association for Computational Linguistics, pp.207-218, 2014.

R. K. Srihari, Z. Zhang, and A. Rao, Intelligent indexing and semantic retrieval of multimodal documents, Information Retrieval, vol.2, issue.2/3, pp.245-275, 2000.
DOI : 10.1023/A:1009962928226

N. Srivastava and R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, Advances in neural information processing systems, pp.2222-2230, 2012.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1-9, 2015.
DOI : 10.1109/CVPR.2015.7298594
URL : http://arxiv.org/abs/1409.4842

Y. Tamaazousti, L. Borgne, H. Popescu, A. Gadeski, E. Ginsca et al., Vision-language integration using constrained local semantic features, Computer Vision and Image Understanding, 2017.
DOI : 10.1016/j.cviu.2017.05.017

Y. Tammazousti, L. Borgne, H. Popescu, and A. , Constrained Local Enhancement of Semantic Features by Content-Based Sparsity, Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR '16, 2016.
DOI : 10.1145/2733373.2806244

T. Q. Tran, L. Borgne, H. Crucianu, and M. , Combining Generic and Specific Information for Cross-modal Retrieval, Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ICMR '15, pp.551-554, 2015.
DOI : 10.1145/2502081.2502087

T. Q. Tran, L. Borgne, H. Crucianu, and M. , Aggregating Image and Text Quantized Correlated Components, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.225

T. Q. Tran, L. Borgne, H. Crucianu, and M. , Cross-modal Classification by Completing Unimodal Representations, Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion, iV&L-MM '16, pp.17-25, 2016.
DOI : 10.1109/CVPR.2009.5206816

R. Udupa and M. Khapra, Improving the multilingual user experience of wikipedia using cross-language name search, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp.492-500, 2010.

N. Vasconcelos, Minimum Probability of Error Image Retrieval, IEEE Transactions on Signal Processing, vol.52, issue.8, pp.2322-2336, 2004.
DOI : 10.1109/TSP.2004.831125
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.163.1173

M. Villegas, R. Paredes, and B. Thomee, Overview of the imageclef 2013 scalable concept image annotation subtask, 2013.

V. Vukoti?, C. Raymond, and G. Gravier, Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications, Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR '16, pp.343-346, 2016.
DOI : 10.1007/s10994-010-5198-3

G. Wang, D. Hoiem, and D. Forsyth, Building text features for object image classification, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.1367-1374, 2009.
DOI : 10.1109/CVPR.2009.5206816

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang et al., Locality-constrained Linear Coding for image classification, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3360-3367, 2010.
DOI : 10.1109/CVPR.2010.5540018
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.175.2312

K. Wang, R. He, L. Wang, W. Wang, and T. Tan, Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.10, 2015.
DOI : 10.1109/TPAMI.2015.2505311

K. Wang, Q. Yin, W. Wang, S. Wu, W. et al., A comprehensive survey on cross-modal retrieval, 2016.

L. Wang, Y. Li, and S. Lazebnik, Learning Deep Structure-Preserving Image-Text Embeddings, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.541
URL : http://arxiv.org/abs/1511.06078

Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang, Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval, Proceedings of the ACM International Conference on Multimedia, MM '14, pp.307-316, 2014.
DOI : 10.1109/TMM.2013.2291214

Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong et al., Cnn: Single-label to multi-label. arXiv preprint, 2014.

J. Weston, S. Bengio, and N. Usunier, Wsabie: Scaling up to large vocabulary image annotation, IJCAI, pp.2764-2770, 2011.

D. Williams and G. Hinton, Learning representations by back-propagating errors, Nature, vol.323, pp.533-536, 1986.

S. Xie and Z. Tu, Holistically-nested edge detection, The IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1007/s11263-017-1004-z
URL : http://arxiv.org/abs/1504.06375

F. Yan and K. Mikolajczyk, Deep correlation for matching images and text, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298966

J. Yang, K. Yu, Y. Gong, and T. Huang, Linear spatial pyramid matching using sparse coding for image classification, Computer Vision and Pattern Recognition CVPR 2009. IEEE Conference on, pp.1794-1801, 2009.

T. Yao, T. Mei, and C. Ngo, Learning Query and Image Similarities with Ranking Canonical Correlation Analysis, 2015 IEEE International Conference on Computer Vision (ICCV), pp.28-36, 2015.
DOI : 10.1109/ICCV.2015.12

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, pp.67-78, 2014.

J. Zbontar and Y. Lecun, Computing the stereo matching cost with a convolutional neural network, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298767

M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, 1311.
DOI : 10.1007/978-3-319-10590-1_53
URL : http://arxiv.org/abs/1311.2901

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, Learning deep features for scene recognition using places database, Advances in Neural Information Processing Systems 27, pp.487-495, 2014.
DOI : 10.1109/tpami.2017.2723009

X. S. Zhou and T. S. Huang, Cbir: from low-level features to high-level semantics, 2000.
DOI : 10.1117/12.382975
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.6641

A. Znaidia, Handling imperfections for multimodal image annotation, 2014.
URL : https://hal.archives-ouvertes.fr/tel-01012009

A. Znaidia, A. Shabou, L. Borgne, H. Hudelot, C. et al., Bag-ofmultimedia-words for image classification, ICPR, pp.1509-1512, 2012.