A. B. and T. , Tobii Studio User's Manual Version 3.4.5 (cit, p.83, 2016.

K. Aizawa and M. Ogawa, FoodLog: Multimedia Tool for Healthcare Applications, IEEE MultiMedia, vol.22, pp.4-8, 2015.

K. Aizawa, Y. Maruyama, H. Li, and C. Morikawa, Food Balance Estimation by Using Personal Dietary Tendencies in a Multimedia Food Log, IEEE Transactions on Multimedia, vol.40, p.22, 2013.

S. Amano, I. Information, K. Aizawa, and M. Ogawa, Frequency Statistics of Words Used in Japanese Food Records of FoodLog, ACM UbiComp, p.23, 2014.

J. Amores, Multiple Instance Classification: Review, Taxonomy and Comparative Study, Artif. Intell, vol.201, p.35, 2013.

S. Andrews, I. Tsochantaridis, and T. Hofmann, Support Vector Machines for Multiple-Instance Learning, Advances in Neural Information Processing Systems (NIPS), p.34, 2002.

R. Arandjelovic, P. Gronát, A. Torii, T. Pajdla, and J. Sivic, NetVLAD: CNN Architecture for Weakly Supervised Place Recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5297-5307, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01242052

S. Avila, N. Thome, M. Cord, E. Valle, and A. Araujo, Pooling in Image Representation: the Visual Codeword Point of View, Computer Vision and Image Understanding, vol.54, p.46, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01172709

H. Azizpour, M. Arefiyan, S. Sobhan-naderi-parizi, and . Carlsson, Spotlight the Negatives: A Generalized Discriminative Latent Model, British Machine Vision Conference, pp.1-11, 2015.

B. Babenko, Multiple Instance Learning: Algorithms and Applications, p.35, 2009.

H. Bay, A. Ess, T. Tuytelaars, and L. Van-gool, In: Computer Vision and Image Understanding 110.3. Similarity Matching in Computer Vision and Multimedia, p.15, 2008.

O. Beijbom, N. Joshi, D. Morris, S. Saponas, and S. Khullar, Menu-Match: Restaurant-Specific Food Logging from Images, 2015 IEEE Winter Conference on Applications of Computer Vision, p.22, 2015.
DOI : 10.1109/wacv.2015.117

H. Ben-younes, R. Cadène, N. Thome, and M. Cord, MUTAN: Multimodal Tucker Fusion for Visual Question Answering, p.96, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02073637

V. Bettadapura, E. Thomaz, A. Parnami, G. D. Abowd, and I. Essa, Leveraging Context to Support Automated Food Recognition in Restaurants, Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision. WACV '15, p.22, 2015.

. Bilen, . Hakan, P. Vinay, L. J. Namboodiri, and . Van-gool, Active labeling application applied to food-related object recognition, Bolã nos, Marc, M Garolera, and P Radeva, vol.106, p.23, 2013.

A. Borji and L. Itti, Defending Yarbus: Eye movements reveal observers' task, In: Journal of Vision, vol.14, p.30, 2014.
DOI : 10.1167/14.3.29

URL : https://jov.arvojournals.org/data/journals/jov/932817/i1534-7362-14-3-29.pdf

L. Bossard, Food-Mining-101 Discriminative Components with Random Forests, vol.23, pp.39-41, 2014.
DOI : 10.1007/978-3-319-10599-4_29

A. Bulling, J. A. Ward, H. Gellersen, and G. Troster, Eye Movement Analysis for Activity Recognition Using Electrooculography, IEEE Trans. Pattern Anal. Mach. Intell, vol.33, p.31, 2011.

R. Cadène, N. Thome, and M. Cord, Master's Thesis : Deep Learning for Visual Recognition, p.47, 2016.

M. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon, Multiple Instance Learning: A Survey of Problem Characteristics and Applications, p.35, 2016.

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the Devil in the Details: Delving Deep into Convolutional Nets, British Machine Vision Conference (BMVC), 2014.

J. Chen, L. Pang, and C. Ngo, Cross-Modal Recipe Retrieval: How to Cook this Dish?, In: MultiMedia Modeling: 23rd International Conference, MMM 2017, p.22, 2017.

. Chen, PFID: Pittsburgh fast-food image dataset, In: ICIP (cit, vol.21, p.20, 2009.

S. Christodoulidis, M. Anthimopoulos, S. G. Mougiakakou, ;. Ctmr, R. Isca et al., Food Recognition for Dietary Assessment Using Deep Convolutional Neural Networks, New Trends in Image Analysis and Processing-ICIAP 2015 Workshops-ICIAP 2015 International Workshops: BioFor, p.22, 2015.

. Cisco, White paper: Cisco VNI Forecast and Methodology, p.1, 2016.

M. Cord and P. H. Gosselin, Image retrieval using long-term semantic learning, IEEE, p.54, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00520307

. Csurka, C. R. Gabriella, L. Dance, J. Fan, C. Willamowski et al., Visual Categorization with Bags of Keypoints, Workshop on Statistical Learning in Computer Vision. ECCV (cit, p.15, 2004.

N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), vol.1, p.15, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00548512

D. Damen, T. Leelasawassuk, and W. Mayol-cuevas, You-Do, I-Learn: Egocentric unsupervised discovery of objects and their modes of interaction towards video-based guidance, Computer Vision and Image Understanding, vol.149, p.31, 2016.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., ImageNet: A Large-Scale Hierarchical Image Database, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.248-255, 2009.

T. Deselaers and V. Ferrari, A Conditional Random Field for Multiple-Instance Learning, Proceedings of the 27th International Conference on Machine Learning (ICML-10). Ed. by Johannes F ¨ urnkranz and Thorsten Joachims. Omnipress, p.35, 2010.

T. G. Dietterich, H. Richard, T. Lathrop, and . Lozano-pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence, vol.89, p.33, 1997.

T. Durand, N. Thome, and M. Cord, MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking, International Conference on Computer Vision, pp.2713-2721, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01343784

T. Durand, N. Thome, and M. Cord, WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks, IEEE Conference on Computer Vision and Pattern Recognition, pp.4743-4752, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01343785

T. Durand, N. Thome, M. Cord, and D. Picard, Incremental learning of latent structural SVM for weakly supervised image classification, IEEE International Conference on Image Processing, p.35, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01077058

T. Durand, T. Mordan, N. Thome, and M. Cord, WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
URL : https://hal.archives-ouvertes.fr/hal-01515640

U. Engelke, H. Liu, J. Wang, P. L. Callet, I. Heynderickx et al., Comparative Study of Fixation Density Maps, IEEE Transactions on Image Processing, vol.22, p.31, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00757423

M. Everingham, S. M. Ali-eslami, J. Luc, . Van-gool, K. I. Christopher et al., The Pascal Visual Object Classes Challenge: A Retrospective, In: International Journal of Computer Vision, vol.111, pp.98-136, 2015.

R. Fan, LIBLINEAR: A library for large linear classification, In: JMLR (cit, p.45, 2008.

G. M. Farinella, M. Moltisanti, and S. Battiato, Classifying food images represented as Bag of Textons, 2014 IEEE International Conference on Image Processing (ICIP), p.22, 2014.

G. M. Farinella, A Benchmark Dataset to Study Representation of Food Images, In: ECCV workshop, vol.21, p.20, 2014.

. Fasel, F. Beat, D. Monay, and . Gatica-perez, Latent Semantic Analysis of Facial Action Codes for Automatic Facial Expression Recognition, Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval. MIR '04, p.15, 2004.

A. Fathi, Y. Li, and J. M. Rehg, Learning to Recognize Daily Actions Using Gaze, European Conference on Computer Vision, pp.314-327, 2012.

. Fei-fei, A. Li, C. Iyer, P. Koch, and . Perona, What do we perceive in a glance of a real-world scene?, In: Journal of Vision, vol.7, p.65, 2007.

P. F. Felzenszwalb, B. Ross, D. A. Girshick, D. Mcallester, and . Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Trans. Pattern Anal. Mach. Intell, vol.32, pp.1627-1645, 2010.

J. Foulds and E. Frank, A review of multi-instance learning assumptions, In: The Knowledge Engineering Review, vol.25, p.35, 2010.

J. Fournier, M. Cord, and S. , Backpropagation algorithm for relevance feedback in image retrieval, Proceedings. 2001 International Conference on, vol.1, p.54, 2001.

K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, In: Biological Cybernetics, vol.36, p.4, 1980.

A. S. Garcez, G. Avila, and . Zaverucha, Multi-instance learning using recurrent neural networks, The 2012 International Joint Conference on Neural Networks (IJCNN), p.35, 2012.

G. Ge, K. Yun, D. Samaras, and G. J. Zelinsky, Action classification in still images using human eye movements, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.16-23, 2015.

S. Gilani, R. Omer, Y. Subramanian, D. Yan, N. Melcher et al., PET: An eye-tracking dataset for animal-centric Pascal object classes, IEEE International Conference on Multimedia and Expo, p.32, 2015.

G. Gkioxari, R. Girshick, and J. Malik, Actions and Attributes from Wholes and Parts, International Conference on Computer Vision (ICCV), p.86, 2015.

Y. Gong, L. Wang, R. Guo, and S. Lazebnik, Multiscale Orderless Pooling of Deep Convolutional Activation Features, European Conference on Computer Vision (ECCV), pp.392-407, 2014.

A. Gordo, A. Gaidon, and F. Perronnin, Deep Fishing: Gradient Features from Deep Nets, British Machine Vision Conference, vol.87, p.86, 2015.

D. Gorisse, M. Cord, and F. Precioso, SALSAS: Sub-linear active learning strategy with approximate k-NN search, Pattern Recognition 44, vol.10, p.54, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00773102

P. H. Gosselin and M. Cord, Active learning methods for interactive image retrieval, Image Processing, p.54, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00520292

P. Gosselin, M. Henri, and . Cord, RETIN AL: An active learning strategy for image category retrieval, Image Processing, vol.4, p.54, 2004.
URL : https://hal.archives-ouvertes.fr/hal-00520315

. Gärtner, P. A. Thomas, A. Flach, A. J. Kowalczyk, and . Smola, Multi-Instance Kernels, Proc. 19th International Conf. on Machine Learning, p.33, 2002.

S. S. Hacisalihzade, W. Lawrence, J. S. Stark, and . Allen, Visual Perception and Sequences of Eye Movement Fixations: A Stochastic Modeling Approach, IEEE Transactions on Systems, Man and Cybernetics, vol.22, p.31, 1992.

. Haji-abolhassani, J. J. Amin, and . Clark, An inverse Yarbus process: Predicting observers' task from eye movement patterns, Vision Research 103, p.30, 2014.

H. Hassannejad, G. Matrella, P. Ciampolini, M. Ilaria-de-munari, S. Mordonini et al., Food Image Recognition Using Very Deep Convolutional Networks, Proceedings of the 2Nd International Workshop on Multimedia Assisted Dietary Management. MADiMa '16, vol.40, p.23, 2016.

H. He, F. Kong, and J. Tan, DietCam: Multiview Food Recognition Using a Multikernel SVM, IEEE Journal of Biomedical and Health Informatics, vol.20, pp.20-22, 2016.

K. He, X. Zhang, S. Ren, and J. Sun, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015 IEEE International Conference on Computer Vision, ICCV 2015, pp.1026-1034, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, IEEE Trans. Pattern Anal. Mach. Intell, vol.37, issue.6, pp.1904-1916, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.18, pp.770-778, 2016.

Y. He, C. Xu, N. Khanna, C. J. Boushey, and E. J. Delp, Analysis of food images: Features and classification, 2014 IEEE International Conference on Image Processing (ICIP), vol.21, p.20, 2014.

L. Herranz, S. Jiang, and R. Xu, Modeling Restaurant Context for Food Recognition, IEEE Transactions on Multimedia 19, vol.2, p.22, 2017.

F. Herrera, S. Ventura, R. Bello, C. Cornelis, and A. Zafra, Multiple instance learning : foundations and algorithms, Dánel Sánchez-TarragóTarrag´Tarragó, and Sarah Vluymans, p.35, 2016.

M. Hoai, Regularized max pooling for image categorization, British Machine Vision Conference (BMVC), p.86, 2014.
DOI : 10.5244/c.28.32

URL : http://www.bmva.org/bmvc/2014/files/abstract072.pdf

H. Hoashi, T. Joutou, and K. Yanai, Image Recognition of 85 Food Categories by Feature Fusion, 2010 IEEE International Symposium on Multimedia, p.22, 2010.

D. Hubel and T. N. Wiesel, Receptive Fields, Binocular Interaction, and Functional Architecture in the Cat's Visual Cortex, In: Journal of Physiology, vol.160, pp.106-154, 1962.

E. Huey and . Burke, The psychology and pedagogy of reading, p.24, 1908.

Z. Hussain, A. Klami, J. Kujala, A. P. Leung, K. Pasupa et al., PinView: Implicit Feedback in Content-Based Image Retrieval, p.31, 2014.

J. Hessel, N. Savva, and M. J. Wilber, Image Representations and New Domains in Neural Image Captioning, EMNLP Vision + Learning workshop, p.22, 2015.

R. J. Jacob and K. S. Karn, Eye Tracking in Human-Computer Interaction and Usability Research: Ready to Deliver the Promises, p.30, 2003.

R. J. Jacob, In: Virtual environments and advanced interface design, p.30, 1995.

J. Chen and C. Ngo, Deep-based Ingredient Recognition for Cooking Recipe Retrieval, In: ACMMM (cit. on pp. 9, vol.40, pp.20-23, 2016.

T. Joachims, T. Finley, and C. Yu, Cuttingplane training of structural SVMs, Machine Learning 77.1, p.61, 2009.

M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman, Blocks That Shout: Distinctive Parts for Scene Classification, IEEE Conference on Computer Vision and Pattern Recognition, p.34, 2013.

J. Marcel, A. , C. Patricia, and A. , A theory of reading: From eye fixations to comprehension, In: Psychological Review, vol.87, p.24, 1980.

H. Kagaya, K. Aizawa, and M. Ogawa, Food Detection and Recognition Using Convolutional Neural Network, Proceedings of the 22Nd ACM International Conference on Multimedia. MM '14, p.22, 2014.
DOI : 10.1145/2647868.2654970

S. Karthikeyan, V. Jagadeesh, R. Shenoy, M. Ecksteinz, and B. S. Manjunath, From Where and How to What We See, International Conference on Computer Vision, pp.625-632, 2013.
DOI : 10.1109/iccv.2013.83

S. Karthikeyan, T. Ngo, M. P. Eckstein, and B. S. Manjunath, Eye tracking assisted extraction of attentionally important objects from videos, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.31, 2015.

A. B. Kashlak, H. Eoin-devane, H. Dietert, and . Jackson, Markov models for ocular fixation locations in the presence and absence of colour, In: Journal of the Royal Statistical Society: Series C (Applied Statistics, p.31, 2017.

Y. Kawano and K. Yanai, Automatic Expansion of a Food Image Dataset Leveraging Existing Categories with Domain Adaptation, Proc. of ECCV Workshop on TASK-CV (cit, p.41, 2014.

Y. Kawano and K. Yanai, Food image recognition with deep convolutional features, In: ACM UbiComp (cit. on pp, vol.22, 2014.

Y. Kawano and K. Yanai, FoodCam-256: A Large-scale Realtime Mobile Food RecognitionSystem employing High-Dimensional Features and Compression of Classifier Weights, ACM International Conference on Multimedia, pp.761-762, 2014.

Y. Kawano and K. Yanai, FoodCam-256: A Large-scale Realtime Mobile Food RecognitionSystem Employing High-Dimensional Features and Compression of Classifier Weights, Proceedings of the 22Nd ACM International Conference on Multimedia. MM '14, pp.761-762, 2014.

Y. Kawano and K. Yanai, FoodCam: A real-time food recognition system on a smartphone, Multimedia Tools and Applications 74, vol.14, p.23, 2015.

K. Yanai and Y. Kawano, FOOD IMAGE RECOGNITION USING DEEP CONVOLUTIONAL NETWORK WITH PRE-TRAINING AND FINETUNING, IEEE Internatinal Conference on Multimedia and Exposition, workshop CEA, 2015.

K. Kesorn and S. Poslad, An Enhanced Bag-of-Visual Word Vector Space Model to Represent Visual Content in Athletics Images, IEEE Transactions on Multimedia, issue.1, p.16, 2012.

N. Khanna, An Overview of the Technology Assisted Dietary Assessment Project at Purdue University, Proc. IEEE Int. Symp. Multimedia (cit, p.24, 2010.

K. Kitamura, C. Silva, T. Yamasaki, and K. Aizawa, Image processing based approach to food balance analysis for personal food logging, 2010 IEEE International Conference on Multimedia and Expo, vol.40, p.22, 2010.

A. Klami and C. Saunders, , 2008.

, Can relevance of images be inferred from eye movements?, In: Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval, p.26, 2008.

K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar et al., Eye Tracking for Everyone, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit, p.31, 2016.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems (NIPS), vol.18, pp.1097-1105, 2012.

S. S. Kruthiventi, J. H. Vennela-gudisa, R. Dholakiya, . Venkatesh, and . Babu, Saliency Unified: A Deep Architecture for simultaneous Eye Fixation Prediction and Salient Object Segmentation, IEEE Conference on Computer Vision and Pattern Recognition, p.31, 2016.

M. Kumar, B. Pawan, D. Packer, and . Koller, Self-Paced Learning for Latent Variable Models, Advances in Neural Information Processing Systems (NIPS), p.35, 2010.

Q. Le and T. Mikolov, Distributed Representations of Sentences and Documents, In: ICML (cit, p.51, 2014.

L. Callet, P. , and E. Niebur, Visual Attention and Applications in Multimedia Technologies, Proceedings of the IEEE 101.9, p.30, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00951728

L. Meur, O. , and P. L. Callet, What we see is most likely to be what matters: Visual attention and applications, 16th IEEE International Conference on Image Processing (ICIP), p.30, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00441011

L. Meur, O. , P. L. Callet, D. Barba, and D. Thoreau, A coherent computational approach to model bottom-up visual attention, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.28, p.30, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00669578

L. Meur, P. L. Olivier, D. Callet, and . Barba, Predicting visual fixations on video based on low-level visual features, Vision Research 47, vol.19, p.96, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00287424

. Learned-miller, G. B. Erik, A. Huang, H. Roychowdhury, G. Li et al., Labeled Faces in the Wild: A Survey, Advances in Face Detection and Facial Image Analysis, pp.189-248, 2016.

Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard et al., Backpropagation applied to handwritten zip code recognition, Neural computation 1.4, vol.18, pp.541-551, 1989.

L. Li, H. Jia, L. Su, E. P. Fei-fei, and . Xing, Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification, Advances in Neural Information Processing Systems, p.66, 2010.

W. Li and N. Vasconcelos, Multiple instance learning for soft bags via top instances, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4277-4285, 2015.

X. Li and A. Godil, Investigating the Bag-of-words Method for 3D Shape Retrieval, In: EURASIP J. Adv. Signal Process, vol.5, p.16, 2010.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common Objects in Context, Computer Vision-ECCV 2014: 13th European Conference, pp.740-755, 2014.

C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane et al., DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment, Inclusive Smart Cities and Digital Health-14th International Conference on Smart Homes and Health Telematics, ICOST 2016, pp.37-48, 2016.

S. Lopez, A. Revel, D. Lingrand, and F. Precioso, One gaze is worth ten thousand (key-)words, IEEE International Conference on Image Processing (ICIP), vol.32, p.31, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01323204

D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, In: Int. J. Comput. Vision, vol.60, issue.2, pp.91-110, 2004.

L. Herranz-ruihan, S. Xu, and . Jiang, A PROBABILISTIC MODEL FOR FOOD IMAGE RECOGNITION IN RESTAURANTS, IEEE Internatinal Conference on Multimedia and Exposition, pp.20-22, 2015.

L. Wan, M. Zeiler, S. Zhang, Y. Le-cun, and R. Fergus, Regularization of Neural Networks using DropConnect, In: ICML (cit, p.19, 2013.

W. Ma and B. S. Manjunath, NeTra: A Toolbox for Navigating Large Image Databases, In: Multimedia Syst, p.3, 1999.

P. Majaranta and A. Bulling, Eye Tracking and Eye-Based HumanComputer Interaction, p.30, 2014.

O. Maron and T. Lozano-perez, A framework for multiple-instance learning, Advances in Neural Information Processing Systems (NIPS) (cit, p.33, 1998.

O. Maron and T. Lozano-perez, Multiple-Instance Learning for Natural Scene Classification, In: ICML (cit, p.33, 1998.

S. Mathe, A. Pirinen, and C. Sminchisescu, Reinforcement Learning for Visual Object Detection, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.56, p.35, 2016.

S. Mathe and C. Sminchisescu, Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths, Advances in Neural Information Processing Systems, vol.63, pp.1923-1931, 2013.

S. Mathe and C. Sminchisescu, Multiple Instance Reinforcement Learning for Efficient Weakly-Supervised Detection in Images, vol.56, p.35, 2014.

S. Mathe and C. Sminchisescu, Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37, vol.7, p.31, 2015.

Y. Matsuda and K. Yanai, Multiple-food recognition considering co-occurrence employing manifold ranking, In: ICPR (cit, p.22, 2012.

H. Matsunaga, K. Doman, T. Hirayama, I. Ide, D. Deguchi et al., Tastes and Textures Estimation of Foods Based on the Analysis of Its Ingredients List and Image, New Trends in Image Analysis and Processing-ICIAP 2015 Workshops: ICIAP 2015 International Workshops, p.22, 2015.

M. Blaschko, P. Kumar, and B. Taskar, Tutorial: Visual Learning with Weak Supervision, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

M. Posner, ORIENTING OF ATTENTION, The Quarterly Journal of Experimental Psychology, vol.32, p.24, 1980.

T. Mikolov, Distributed representations of words and phrases and their compositionality, NIPS (cit, vol.50, p.40, 2013.

W. Min, S. Jiang, J. Sang, H. Wang, X. Liu et al., Being a Super Cook: Joint Food Attributes and Multi-Modal Content Modeling for Recipe Retrieval and Exploration, IEEE Transactions on Multimedia X.XX, pp.20-22, 2016.

A. K. Mishra, Y. Aloimonos, and L. Cheong, Active segmentation with fixation, IEEE International Conference on Computer Vision, p.31, 2009.

T. Miyazaki, G. C. Silva, and K. Aizawa, Image-based Calorie Content Estimation for Dietary Assessment, 2011 IEEE International Symposium on Multimedia, p.22, 2011.

T. Mordan, N. Thome, G. Henaff, and M. Cord, Deformable Part-based Fully Convolutional Network for Object Detection, Proceedings of the British Machine Vision Conference (BMVC) (cit, p.96, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01637070

A. Myers, N. Johnston, V. Rathod, A. Korattikara, A. Gorban et al., Im2Calories: Towards an automated mobile vision food diary, Proceedings of the IEEE International Conference on Computer Vision, vol.11, pp.20-23, 2016.

A. Ninassi, O. Le-meur, P. L. Callet, and D. Barba, Does where you Gaze on an Image Affect your Perception of Quality? Applying Visual Attention to Image Quality Metric, 2007 IEEE International Conference on Image Processing, vol.2, p.30, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00342599

A. Ninassi, O. Le-meur, P. L. Callet, and D. Barba, Considering Temporal Variations of Spatial Visual Distortions in Video Quality Assessment, IEEE Journal of Selected Topics in Signal Processing, issue.2, p.30, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00345895

J. Noronha, E. Hysen, H. Zhang, and K. Z. Gajos, Platemate: Crowdsourcing Nutritional Analysis from Food Photographs, Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. UIST '11, p.23, 2011.

L. Oliveira, V. Costa, G. Neves, T. Oliveira, E. Jorge et al., A mobile, lightweight, poll-based food identification system, Pattern Recognition 47, vol.5, pp.1941-1952, 2014.

A. Olsen, The Tobii I-VT Fixation Filter (cit, p.80, 2012.

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks, IEEE Conference on Computer Vision and Pattern Recognition, vol.87, pp.1717-1724, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00911179

J. Pan, E. Sayrol, X. Girógir´giró-i-nieto, K. Mcguinness, and N. E. O'connor, Shallow and Deep Convolutional Networks for Saliency Prediction, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.598-606, 2016.

M. Pandey and S. Lazebnik, Scene recognition and weakly supervised object localization with deformable part-based models, IEEE International Conference on Computer Vision, p.34, 2011.

D. P. Papadopoulos, D. F. Alasdair, F. Clarke, V. Keller, and . Ferrari, Training Object Class Detectors from Eye Tracking Data, European Conference on Com-puter Vision (ECCV), vol.32, p.63, 2014.

G. T. Papadopoulos, K. C. Apostolakis, and P. Daras, Gaze-Based Relevance Feedback for Realizing Region-Based Image Retrieval, IEEE Transactions on Multimedia 16, p.31, 2014.

D. Picard, M. Cord, and A. Revel, Image retrieval over networks: Active learning using ant algorithm, Multimedia, IEEE Transactions on 10, vol.7, p.54, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00656363

P. Pouladzadeh, S. Shirmohammadi, and R. Al-maghrabi, Measuring Calorie and Nutrition From Food Image, IEEE Transactions on Instrumentation and Measurement, vol.63, p.22, 2014.

P. Pouladzadeh, A. Yassine, S. Shirmohammadi, ;. Biofor, C. et al., FooDD: Food Detection Dataset for Calorie Measurement Using Food Images, New Trends in Image Analysis and Processing-ICIAP 2015 Workshops: ICIAP 2015 International Workshops, pp.20-22, 2015.

P. Kohli, L. , L. Ladick´y, H. S. Philip, and . Torr, Robust Higher Order Potentials for Enforcing Label Consistency, In: Int. J. Comput. Vision, vol.82, p.31, 2009.

. Ramanathan, V. Subramanian, N. Yanulevskaya, and . Sebe, Can computers learn from humans to see better?: inferring scene semantics from viewers' eye movements, International Conference on Multimedia, p.30, 2011.

S. Ramanathan, H. Katti, N. Sebe, M. S. Kankanhalli, and T. Chua, An Eye Fixation Database for Saliency Detection in Images, European Conference on Computer Vision, p.31, 2010.

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele et al., Generative Adversarial Text-to-Image Synthesis, Proceedings of The 33rd International Conference on Machine Learning, p.97, 2016.

W. Ren, K. Huang, D. Tao, and T. Tan, Weakly Supervised Large Scale Object Localization with Multiple Instance Learning and Bag Splitting, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, pp.405-416, 2016.

M. Rohrbach and S. Amin, A database for fine grained activity detection of cooking activities, In: CVPR (cit, p.21, 2012.

F. Rosenblatt, The perceptron, a perceiving and recognizing automaton Project Para, 1957.

O. Russakovsky, Y. Lin, K. Yu, and L. Fei-fei, Object-Centric Spatial Pooling for Image Classification, Computer Vision-ECCV 2012: 12th European Conference on Computer Vision, pp.1-15, 2012.

O. Russakovsky, Y. Lin, K. Yu, and F. Li, Object-Centric Spatial Pooling for Image Classification, European Conference on Computer Vision, p.35, 2012.

. Russakovsky, A. L. Olga, V. Bearman, F. Ferrari, and . Li, What's the point: Semantic segmentation with point supervision, 2016.

G. Salton and M. J. Mcgill, Introduction to Modern Information Retrieval, p.3, 1986.

A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli et al., Learning Cross-modal Embeddings for Cooking Recipes and Food Images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p.22, 2017.

G. Saon, T. Sercu, S. J. Rennie, and H. J. Kuo, The IBM 2016 English conversational telephone speech recognition system, pp.7-11, 2016.

H. Sattar, A. Bulling, and M. Fritz, Predicting the Category and Attributes of Mental Pictures Using Deep Gaze Pooling, vol.97, p.31, 2016.

H. Sattar, S. Muller, M. Fritz, and A. Bulling, Prediction of Search Targets From Fixations in Open-World Settings, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

F. Schroff, A. Criminisi, and A. Zisserman, Harvesting Image Databases from the Web, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, p.41, 2011.

N. Shapovalova, M. Raptis, L. Sigal, and G. Mori, Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization, Advances in Neural Information Processing Systems, pp.2409-2417, 2013.

. Shcherbatyi, A. Iaroslav, M. Bulling, and . Fritz, GazeDPM: Early Integration of Gaze Information in Deformable Part Models, vol.35, p.31, 2015.

W. Shen, X. Bai, Z. Hu, and Z. Zhang, Multiple instance subspace learning via partial random projection tree for local reflection symmetry in natural images, Pattern Recognition, vol.52, p.34, 2016.

A. Shrivastava, V. M. Patel, J. K. Pillai, and R. Chellappa, Generalized Dictionaries for Multiple Instance Learning, In: Int. J. Comput. Vision, vol.114, issue.2-3, p.34, 2015.

S. David and H. Aja, Mastering the game of Go with deep neural networks and tree search, pp.484-489, 2016.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2015.

J. Sivic and A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos, International Conference on Computer Vision (ICCV), 2003.

A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Content-Based Image Retrieval at the End of the Early Years, IEEE Trans. Pattern Anal. Mach. Intell. 22, vol.12, issue.2, pp.1349-1380, 2000.

Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, Contextualizing object detection and classification, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.87, p.86, 2011.
DOI : 10.1109/cvpr.2011.5995330

URL : https://pure.qub.ac.uk/portal/files/148946383/Contextualizing_Object_Detection_and_Classification_TPAMI2015.pdf

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, In: J. Mach. Learn. Res, vol.15, p.19, 2014.

J. Steil and A. Bulling, Discovery of Everyday Human Activities from Long-term Visual Behaviour Using Topic Models, Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. UbiComp '15, p.31, 2015.

S. Stein and S. J. Mckenna, User-adaptive Models for Recognizing Food Preparation Activities, ACM MM workshop CEA, p.21, 2013.
DOI : 10.1145/2506023.2506031

H. Su, T. Lin, C. Li, M. Shan, and J. Chang, Automatic Recipe Cuisine Classification by Ingredients, Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. UbiComp '14 Adjunct, p.22, 2014.
DOI : 10.1145/2638728.2641335

H. Su, J. Deng, and L. Fei-fei, Crowdsourcing Annotations for Visual Object Detection, AAAI Workshop, p.31, 2012.

J. Sun and J. Ponce, Learning Discriminative Part Detectors for Image Classification and Cosegmentation, International Conference on Computer Vision (ICCV), p.34, 2013.
DOI : 10.1109/iccv.2013.422

URL : https://hal.archives-ouvertes.fr/hal-00932380

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going Deeper with Convolutions, Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/cvpr.2015.7298594

URL : http://arxiv.org/pdf/1409.4842

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the Inception Architecture for Computer Vision, 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, p.40, 2016.

R. Toldo, U. Castellani, and A. Fusiello, A Bag of Words Approach for 3D Object Categorization, Proceedings of the 4th International Conference on Computer Vision/Computer Graphics CollaborationTechniques. MIRAGE '09, p.16, 2009.

V. Vapnik and R. Izmailov, Learning Using Privileged Information: Similarity Control and Knowledge Transfer, In: J. Mach. Learn. Res, vol.16, p.63, 2015.
DOI : 10.1007/978-3-319-17091-6_1

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention Is All You Need, In: Arxiv (cit, p.96, 2017.

E. Vig, M. Dorr, and D. D. Cox, Saliency-based selection of sparse descriptors for action recognition, IEEE International Conference on Image Processing (ICIP) (cit, p.26, 2012.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and Tell: A Neural Image Caption Generator, In: CVPR (cit, p.97, 2015.
DOI : 10.1109/cvpr.2015.7298935

URL : http://arxiv.org/pdf/1411.4555

W. Susanto, M. Rohrbach, and B. Schiele, 3d object detection with multiple kinects, Computer Vision-ECCV 2012. Workshops and Demonstrations, p.21, 2012.
DOI : 10.1007/978-3-642-33868-7_10

T. Walber, A. Scherp, and S. Staab, Can You See It? Two Novel Eye-Tracking-Based Measures for Assigning Tags to Image Regions, Advances in Multimedia Modeling, International Conference, p.31, 2013.
DOI : 10.1007/978-3-642-35725-1_4

H. Wang and M. Pomplun, The attraction of visual attention to texts in real-world scenes, In: Journal of Vision, vol.12, p.80, 2012.

J. Wang, M. P. Da, P. L. Silva, V. Callet, and . Ricordel, Computational Model of Stereoscopic 3D Visual Saliency, IEEE Transactions on Image Processing, vol.22, p.31, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00788847

J. Wang, Y. Li, Y. Zhang, C. Wang, H. Xie et al., Bag-of-Features Based Medical Image Retrieval via Multiple Assignment and Visual Words Weighting, IEEE Trans. Med. Imaging, vol.30, p.15, 2011.

J. Wang and J. Zucker, Solving the Multiple-Instance Problem: A Lazy Learning Approach, Proceedings of the Seventeenth International Conference on Machine Learning. ICML '00, p.33, 2000.

J. Wang, D. M. Chandler, and P. L. Callet, Quantifying the relationship between visual salience and visual importance, vol.7527, p.30, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00494592

X. Wang, N. Thome, and M. Cord, Gaze latent support vector machine for image classification, IEEE International Conference on Image Processing (ICIP), vol.12, pp.236-240, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01342580

X. Wang, N. Thome, and M. Cord, Gaze Latent Support Vector Machine for Image Classification Improved by Weakly Supervised Region Selection, Pattern Recognition, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01557368

X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, Recipe recognition with large multimodal food dataset, IEEE International Conference on Multimedia & Expo Workshops, pp.1-6, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01196959

X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu, Max-Margin Multiple-Instance Dictionary Learning, International Conference on Machine Learning, p.34, 2013.

X. Wang, Z. Zhu, C. Yao, and X. Bai, Relaxed Multiple-Instance SVM with Application to Object Discovery, International Conference on Computer Vision (ICCV), p.34, 2015.

X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu, Revisiting Multiple Instance Neural Networks, p.33, 2016.

Y. Wang and G. Mori, Human Action Recognition by Semilatent Topic Models, IEEE Transactions on Pattern Analysis and Machine Intelligence 31, vol.10, p.15, 2009.

W. Wu and J. Yang, Fast food recognition from videos of eating for calorie estimation, 2009 IEEE International Conference on Multimedia and Expo, p.22, 2009.

H. Xie, L. Yu, and Q. Li, A Hybrid Semantic Item Model for Recipe Search by Example, 2010 IEEE International Symposium on Multimedia, p.22, 2010.

J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg et al., Gaze-enabled egocentric video summarization via constrained submodular maximization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.31, 2015.

R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song et al., Geolocalized Modeling for Dish Recognition, IEEE Transactions on Multimedia 17, vol.8, p.22, 2015.

R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song et al., Geolocalized Modeling for Dish Recognition, IEEE Transactions on Multimedia 17, vol.8, p.20, 2015.

S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, Food recognition using statistics of pairwise local features, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol.40, p.22, 2010.

Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen, Review Networks for Caption Generation, In: NIPS (cit, p.96, 2016.

A. L. Yarbus, Eye Movements and Vision, Plenum, vol.25, p.24, 1967.

P. Ye and D. Doermann, No-Reference Image Quality Assessment Using Visual Codebooks, IEEE Transactions on Image Processing 21, vol.7, p.16, 2012.

J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, How transferable are features in deep neural networks?, In: Advances in Neural Information Processing Systems (NIPS), pp.3320-3328, 2014.

C. Yu, T. John, and . Joachims, Learning structural SVMs with latent variables, In: ICML (cit, p.96, 2009.

A. L. Yuille and A. Rangarajan, The Concave-Convex Procedure (CCCP), Advances in Neural Information Processing Systems, vol.14, p.62, 2001.

K. Yun, Y. Peng, D. Samaras, G. J. Zelinsky, and T. L. Berg, Studying Relationships between Human Gaze, Description, and Computer Vision, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.739-746, 2013.

M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, European Conference on Computer Vision, vol.87, p.86, 2014.

G. J. Zelinsky, Y. Peng, and D. Samaras, Eye can read your mind: Decoding gaze fixations to reveal categorical search targets, In: Journal of Vision, vol.13, p.31, 2013.

M. Zhang and Z. Zhou, Improve Multi-Instance Neural Networks through Feature Selection, In: Neural Processing Letters, vol.19, p.33, 2004.

Q. Zhang and S. A. Goldman, EM-DD: An Improved Multiple-Instance Learning Technique, Advances in Neural Information Processing Systems, p.33, 2001.

W. Zhang, A. Borji, Z. Wang, P. L. Callet, and H. Liu, The Application of Visual Saliency Models in Objective Image Quality Assessment: A Statistical Evaluation, IEEE Transactions on Neural Networks and Learning Systems 27, vol.6, p.30, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01675144

F. Zhou and Y. Lin, Fine-Grained Image Classification by Exploring Bipartite-Graph Labels, 2016 IEEE Conference on Computer Vision and Pattern Recognition, p.22, 2016.

J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu, Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation, Transactions of the Association for Computational Linguistics (TACL), vol.4, pp.371-383, 2016.

Z. Zhou, Multi-Instance Learning: A Survey, rep. National Laboratory for Novel Software Technology (cit, p.35, 2004.

Z. Zhou, Y. Sun, and Y. Li, Multi-instance Learning by Treating Instances As non-I.I.D. Samples, Proceedings of the 26th Annual International Conference on Machine Learning. ICML '09, p.35, 2009.

Z. Zhou and M. Zhang, Ensembles of Multi-instance Learners, Machine Learning: ECML 2003: 14th European Conference on Machine Learning, p.33, 2003.