J. Aggarwal and M. S. Ryoo, Human activity analysis, ACM Computing Surveys, vol.43, issue.3, pp.16-43, 2011.
DOI : 10.1145/1922649.1922653

Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid, Good Practice in Large-Scale Learning for Image Classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.36, issue.3, pp.59-82, 2013.
DOI : 10.1109/TPAMI.2013.146

URL : https://hal.archives-ouvertes.fr/hal-00690014

P. Atrey, M. Hossain, A. Saddik, and M. Kankanhalli, Multimodal fusion for multimedia analysis: a survey, Multimedia Systems, vol.24, issue.11, pp.1-35, 2010.
DOI : 10.1007/s00530-010-0182-0

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with sparsityinducing penalties. arXiv preprint, p.72, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00613125

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Structured Sparsity through Convex Optimization, Statistical Science, vol.27, issue.4, pp.450-468, 2012.
DOI : 10.1214/12-STS394

URL : https://hal.archives-ouvertes.fr/hal-00621245

F. R. Bach, G. R. Lanckriet, and M. I. Jordan, Multiple kernel learning, conic duality, and the SMO algorithm, Twenty-first international conference on Machine learning , ICML '04, pp.6-59, 2004.
DOI : 10.1145/1015330.1015424

L. Ballan, M. Bertini, A. D. Bimbo, and G. Serra, Video event classification using bag of words and string kernels. Image Analysis and Processing?ICIAP, pp.170-178, 2009.

L. Ballan, M. Bertini, A. Del-bimbo, L. Seidenari, and G. Serra, Event Detec- 210 BIBLIOGRAPHY tion and Recognition for Semantic Annotation of Video, Multimedia Tools and Applications, pp.1-24, 2010.

N. Ballas, B. Delezoide, and F. Prêteux, Trajectories based descriptor for dynamic events annotation, Proceedings of the 2011 joint ACM workshop on Modeling and representing events, J-MRE '11, pp.13-18, 2011.
DOI : 10.1145/2072508.2072512

N. Ballas, B. Delezoide, and F. Prêteux, A new point process model for trajectory-based events annotation, Image Processing: Machine Vision Applications V, pp.83000-138, 2012.
DOI : 10.1117/12.912088

A. L. Bao, S. Yu, Z. Lan, A. Overwijk, Q. Jin et al., Informedia@ trecvid 2011 multimedia event detection, semantic indexing, TREC Video Retrieval Evaluation Workshop, vol.1, pp.107-123, 2011.

M. Bar, Visual objects in context, Nature Reviews Neuroscience, vol.8, issue.8, pp.617-629, 2004.
DOI : 10.1016/0001-6918(66)90003-5

M. Barnachon, S. Bouakaz, and B. Boufama, Interprétation temps de mouvement réel, RFIA, p.44, 2012.

H. Bay, T. Tuytelaars, and L. Van-gool, Surf: Speeded up robust features, Computer Vision?ECCV, pp.404-417, 2006.

S. Belongie, J. Malik, and J. Puzicha, Shape matching and object recognition using shape contexts, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.24, issue.4, pp.509-522, 2002.
DOI : 10.1109/34.993558

J. Besag, Spatial interaction and the statistical analysis of lattice systems, Journal of the Royal Statistical Society. Series B (Methodological), pp.192-236, 1974.

V. Bettadapura, G. Schindler, T. Plötz, and I. Essa, Augmenting bag-of-words: Data-driven discovery of temporal and structural information for activity recog- BIBLIOGRAPHY 211

I. Biederman, R. Mezzanotte, and J. Rabinowitz, Scene perception: Detecting and judging objects undergoing relational violations, Cognitive Psychology, vol.14, issue.2, pp.143-177, 1982.
DOI : 10.1016/0010-0285(82)90007-X

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.405.408

M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions as spacetime shapes, Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, pp.1395-1402, 2005.
DOI : 10.1109/iccv.2005.28

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.8218

A. F. Bobick and J. W. Davis, The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.23, issue.3, pp.44-46, 2001.
DOI : 10.1109/34.910878

A. Borji and L. Itti, State-of-the-art in visual attention modeling. Transaction on PAMI, 0198.

Y. Boureau, N. Le-roux, F. Bach, J. Ponce, and Y. Lecun, Ask the locals: Multi-way local pooling for image recognition, 2011 International Conference on Computer Vision, pp.2651-2658, 2011.
DOI : 10.1109/ICCV.2011.6126555

URL : https://hal.archives-ouvertes.fr/hal-00646816

T. Brox and J. Malik, Object Segmentation by Long Term Analysis of Point Trajectories, Computer Vision?ECCV 2010, pp.282-295, 2010.
DOI : 10.1007/978-3-642-15555-0_21

S. S. Bucak, R. Jin, and A. K. Jain, Multiple kernel learning for visual object recognition: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, p.59, 2013.

C. J. Burges, A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, pp.121-167, 1998.

L. Cao, Y. Mu, A. Natsev, S. Chang, G. Hua et al., Scene Aligned Pooling for Complex Video Recognition, pp.79-130, 2012.
DOI : 10.1007/978-3-642-33709-3_49

J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, Semantic Segmentation with Second-Order Pooling, Computer Vision?ECCV 2012, pp.430-443
DOI : 10.1007/978-3-642-33786-4_32

A. Chan-hon-tong, N. Ballas, C. Achard, B. Delezoide, L. Lucat et al., Skeleton point trajectories for human daily activity recognition, VISAPP, 2013. 44, p.56

E. Comision, Horizon 2020: The EU Framework Programme for Research and Innovation, p.33, 2011.

K. Crammer and Y. Singer, On the algorithmic implementation of multiclass kernel-based vector machines, The Journal of Machine Learning Research, vol.2, pp.265-292, 2002.

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp.886-893, 2005.
DOI : 10.1109/CVPR.2005.177

URL : https://hal.archives-ouvertes.fr/inria-00548512

N. Dalal, B. Triggs, and C. Schmid, Human Detection Using Oriented Histograms of Flow and Appearance, Computer Vision?ECCV, vol.38, issue.1, pp.428-441, 2006.
DOI : 10.1023/A:1008162616689

URL : https://hal.archives-ouvertes.fr/inria-00548587

B. Delezoide, Multimedia movie segmentation using low-level and semantic features, p.62

B. Delezoide, G. Pitel, and H. L. Borgne, Object/background scene classification in photographs using linguistic statistics from the web, p.60, 2008.

J. Deng, K. Li, M. Do, H. Su, and L. Fei-fei, Construction and Analysis of a Large Scale Image Ontology, p.186, 2009.

X. Descombes and J. Zerubia, Marked point process in image analysis, IEEE Signal Processing Magazine, vol.19, issue.5, pp.77-84, 2002.
DOI : 10.1109/MSP.2002.1028354

P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-Temporal Features, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp.65-72, 2006.
DOI : 10.1109/VSPETS.2005.1570899

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, p.48, 2010.
DOI : 10.1007/s11263-009-0275-4

M. D. Fairchild, Color appearance models, p.169, 2006.
DOI : 10.1002/9781118653128

C. Farabet, C. Couprie, L. Najman, and Y. Lecun, Learning hierarchical features for scene labeling. Transactions on Pattern Analysis and Machine Intelligence, 0201.
URL : https://hal.archives-ouvertes.fr/hal-00742077

G. Farnebäck, Two-Frame Motion Estimation Based on Polynomial Expansion, Image Analysis, vol.51, issue.168, p.170, 2003.
DOI : 10.1007/3-540-45103-X_50

J. Farquhar, S. Szedmak, H. Meng, and J. Shawe-taylor, Improving " bagof-keypoints " image categorization: generative models and pdf-kernels, p.52, 2005.

V. Ferrari, M. Marin-jimenez, and A. Zisserman, Pose search: Retrieving people using their pose, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2009.
DOI : 10.1109/CVPR.2009.5206495

W. T. Freeman, Where computer vision needs help from computer science, ACM-SIAM Symposium on Discrete Algorithms. SIAM, p.34, 2011.
DOI : 10.1137/1.9781611973082.64

T. Gevers and A. Smeulders, Color-based object recognition, Image Analysis and Processing, pp.319-326, 1997.
DOI : 10.1016/S0031-3203(98)00036-3

C. Geyer and J. Møller, Simulation procedures and likelihood inference for spatial point processes, Scandinavian Journal of Statistics, vol.138, pp.359-373, 1994.

A. Gilbert, J. Illingworth, and R. Bowden, Fast realistic multi-action recognition using mined dense spatio-temporal features, 2009 IEEE 12th International Conference on Computer Vision, p.78, 2010.
DOI : 10.1109/ICCV.2009.5459335

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.158.3113

A. Gilbert, J. Illingworth, and R. Bowden, Action recognition using mined hierarchical compound features. Transaction on PAMI, pp.129-196, 2011.
DOI : 10.1109/tpami.2010.144

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.301.1835

M. Gönen and E. Alpayd?n, Multiple kernel learning algorithms. The journal of machine learning, p.71, 2011.

. Google, Youtube online statistic, 2013. URL http, p.74

L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, Actions as spacetime shapes. Transactions on Pattern Analysis and Machine Intelligence, pp.45-63, 2007.

J. Gorski, F. Pfeuffer, and K. Klamroth, Biconvex sets and optimization with biconvex functions: a survey and extensions, Mathematical Methods of Operations Research, vol.21, issue.1, p.109, 2007.
DOI : 10.1007/s00186-007-0161-1

K. Guo, P. Ishwar, and J. Konrad, Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels, Advanced Video and Signal Based Surveillance, p.112, 2010.
DOI : 10.1007/978-3-642-17711-8_30

A. Gupta, A. Kembhavi, and L. S. Davis, Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.31, issue.10, pp.44-55, 2009.
DOI : 10.1109/TPAMI.2009.83

T. Harada, Y. Ushiku, Y. Yamashita, and Y. Kuniyoshi, Discriminative spatial pyramid, CVPR 2011, pp.160-165, 2011.
DOI : 10.1109/CVPR.2011.5995691

C. Harris and M. Stephens, A Combined Corner and Edge Detector, Procedings of the Alvey Vision Conference 1988, pp.50-99, 1988.
DOI : 10.5244/C.2.23

A. Haubold and M. Naphade, Classification of video events using 4-dimensional time-compressed motion features, Proceedings of the 6th ACM international conference on Image and video retrieval, CIVR '07, pp.178-185, 2007.
DOI : 10.1145/1282280.1282311

A. Hauptmann, R. Yan, and W. Lin, How many high-level concepts will fill the semantic gap in news video retrieval?, Proceedings of the 6th ACM international conference on Image and video retrieval, CIVR '07, pp.44-201, 2007.
DOI : 10.1145/1282280.1282369

A. Hauptmann, M. Chen, M. Christel, W. Lin, and J. Yang, A Multi-Pronged Approach to Improving Semantic Extraction of News Video, Journal of Signal Processing Systems, vol.2, issue.2, pp.373-385, 2010.
DOI : 10.1007/s11265-009-0382-z

M. Heikkilä, M. Pietikäinen, and C. Schmid, Description of interest regions with center-symmetric local binary patterns. Computer Vision, Graphics and Image Processing, pp.58-69, 2006.

C. Huang, H. Shih, and C. Chao, Semantic analysis of soccer video using dynamic Bayesian network. Multimedia, IEEE Transactions on, vol.8, issue.4, pp.749-760, 2006.

Y. Huang, K. Huang, Y. Yu, and T. Tan, Salient coding for image classification, CVPR 2011, pp.1753-1760, 2011.
DOI : 10.1109/CVPR.2011.5995682

N. Ikizler-cinbis and S. Sclaroff, Object, Scene and Actions: Combining Multiple Features for Human Action Recognition, Computer Vision?ECCV, pp.494-507, 2010.
DOI : 10.1007/978-3-642-15549-9_36

N. Inoue, Y. Kamishima, T. Wada, K. Shinoda, and S. Sato, Tokyotech+ canon at trecvid 2011, Proceedings of NIST TRECVID Workshop, p.62, 2011.

L. Itti and C. Koch, Computational modelling of visual attention, Nature Reviews Neuroscience, vol.2, issue.3, pp.194-203, 2001.
DOI : 10.1038/35058500

L. Itti, C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.20, issue.11, p.168, 1998.
DOI : 10.1109/34.730558

A. Jain, A. Gupta, M. Rodriguez, and L. S. Davis, Representing Videos Using Mid-level Discriminative Patches, 2013 IEEE Conference on Computer Vision and Pattern Recognition, p.77, 2013.
DOI : 10.1109/CVPR.2013.332

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.307.3329

M. Jain, H. Jégou, and P. Bouthemy, Better Exploiting Motion for Better Action Recognition, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.40-51, 2013.
DOI : 10.1109/CVPR.2013.330

URL : https://hal.archives-ouvertes.fr/hal-00813014

W. James, The principles of psychology, p.158, 1980.

H. Jegou, M. Douze, C. Schmid, and P. Perez, Aggregating local descriptors into a compact image representation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3304-3311, 2010.
DOI : 10.1109/CVPR.2010.5540039

URL : https://hal.archives-ouvertes.fr/inria-00548637

H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez et al., Aggregating Local Image Descriptors into Compact Codes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.9, 0197.
DOI : 10.1109/TPAMI.2011.235

Y. Jia, C. Huang, and T. Darrell, Beyond spatial pyramids: Receptive field learning for pooled image features, CVPR. IEEE, p.165, 2012.

F. Jiang, J. Yuan, S. Tsaftaris, and A. Katsaggelos, Video anomaly detection in spatiotemporal context, 2010 IEEE International Conference on Image Processing, p.78, 2010.
DOI : 10.1109/ICIP.2010.5650993

W. Jiang, Advanced Techniques for Semantic Concept Detection in General Videos, p.60, 2010.

Y. Jiang, J. Wang, S. Chang, and C. Ngo, Domain adaptive semantic diffusion for large scale context-based video annotation, pp.1420-1427, 2010.

Y. Jiang, X. Zeng, G. Ye, D. Ellis, S. Chang et al., Columbia-ucf trecvid2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching, TRECVID, p.62, 2010.

Y. Jiang, G. Ye, S. Chang, D. Ellis, and A. C. Loui, Consumer video understanding, Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR '11, pp.29-62, 2011.
DOI : 10.1145/1991996.1992025

Y. Jiang, S. Bhattacharya, S. Chang, and M. Shah, High-level event recognition in unconstrained videos, International Journal of Multimedia Information Retrieval, vol.73, issue.2, pp.1-29, 2012.
DOI : 10.1007/s13735-012-0024-2

Y. Jiang, Q. Dai, X. Xue, W. Liu, and C. Ngo, Trajectory-Based Modeling of Human Actions with Motion Reference Points, pp.51-193, 2012.
DOI : 10.1007/978-3-642-33715-4_31

G. Johansson, Visual perception of biological motion and a model for its analysis, Perception & Psychophysics, vol.4, issue.2, p.54, 1973.
DOI : 10.3758/BF03212378

S. Karaman, L. Seidenari, A. D. Bagdanov, and A. D. Bimbo, L1-regularized logistic regression stacking and transductive crf smoothing for action recognition in video

Y. Karklin and L. M. , Is early vision optimized for extracting higher-order dependencies?, Advances in Neural Information Processing Systems (NIPS), pp.99-101, 2006.

Y. Ke, R. Sukthankar, and M. Hebert, Efficient visual event detection using volumetric features, p.50, 2005.

A. Kläser, Learning human actions in video, p.34, 2010.

A. Kläser, M. Marsza?ek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, pp.995-1004, 2008.
DOI : 10.5244/C.22.99

O. Kliper-gross, Y. Gurovich, T. Hassner, and L. Wolf, Motion Interchange Patterns for Action Recognition in Unconstrained Videos, p.69, 2012.
DOI : 10.1007/978-3-642-33783-3_19

J. Kludas, E. Bruno, and S. Marchand-maillet, Information Fusion in Multimedia Information Retrieval, Adaptive Multimedial Retrieval: Retrieval, User, and Semantics, pp.147-159, 2008.
DOI : 10.1007/978-3-540-79860-6_12

P. Koniusz, F. Yan, and K. Mikolajczyk, Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection, Computer Vision and Image Understanding, vol.117, issue.5, p.163, 2012.
DOI : 10.1016/j.cviu.2012.10.010

A. Kovashka and K. Grauman, Learning a hierarchy of discriminative spacetime neighborhood features for human action recognition, CVPR. IEEE, pp.66-130, 2010.

J. Krapac, J. Verbeek, and F. Jurie, Modeling spatial layout with fisher vectors for image categorization, 2011 International Conference on Computer Vision, pp.1487-1494, 2011.
DOI : 10.1109/ICCV.2011.6126406

URL : https://hal.archives-ouvertes.fr/inria-00612277

A. Krizhevsky, G. Sutskever, and . Hinton, Image classification with deep convolutional neural networks, Advances in Neural Information Processing Systems (NIPS), 0201.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, pp.40-188, 2011.
DOI : 10.1109/ICCV.2011.6126543

S. Kumar and M. Hebert, Discriminative random fields: A discriminative framework for contextual interaction in classification, p.61, 2008.

J. Lafferty, A. Mccallum, and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE, pp.282-289, 2001.

Z. Lan, Y. Yan, B. N. , and H. A. , Resource Constrained Multimedia Event Detection, In ACM Multimedia Modeling. IEEE, vol.186, p.194, 2014.
DOI : 10.1007/978-3-319-04114-8_33

Z. Lan, L. Bao, S. Yu, W. Liu, and A. G. Hauptmann, Multimedia classification and event detection using double fusion, Multimedia Tools and Applications, pp.1-15, 2013.
DOI : 10.1007/s11042-013-1391-2

I. Laptev, On Space-Time Interest Points, International Journal of Computer Vision, vol.17, issue.8, pp.107-123, 2005.
DOI : 10.1007/s11263-005-1838-7

I. Laptev and P. Pérez, Retrieving actions in movies, 2007 IEEE 11th International Conference on Computer Vision, pp.1-8, 2007.
DOI : 10.1109/ICCV.2007.4409105

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.130-134, 2008.
DOI : 10.1109/CVPR.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

G. Lavee, E. Rivlin, and M. Rudzsky, Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol.39, issue.5, pp.489-504, 2009.

S. Lazebnik and M. Raginsky, Supervised learning of quantizer codebooks by information loss minimization. Transactions on Pattern Analysis and Machine Intelligence, p.52, 2009.

S. Lazebnik, C. Schmid, and J. Ponce, A sparse texture representation using local affine regions, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.27, issue.8, pp.1265-1278, 2005.
DOI : 10.1109/TPAMI.2005.151

URL : https://hal.archives-ouvertes.fr/inria-00548530

S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Volume 2 (CVPR'06), pp.130-134, 2006.
DOI : 10.1109/CVPR.2006.68

URL : https://hal.archives-ouvertes.fr/inria-00548585

Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, CVPR 2011, p.201, 2011.
DOI : 10.1109/CVPR.2011.5995496

Y. Lecun, S. Chopra, R. Hadsell, R. Marc-'aurelio, and F. Huang, A tutorial on energy-based learning, Predicting Structured Data, vol.1, issue.81, p.82, 2006.

Y. Lee, Y. Lin, and G. Wahba, Multicategory Support Vector Machines, Journal of the American Statistical Association, vol.99, issue.465, pp.9967-81, 2004.
DOI : 10.1198/016214504000000098

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.1879

J. Lezama, K. Alahari, J. Sivic, and I. Laptev, Track to the future: Spatio-temporal video segmentation with long-range motion cues, CVPR 2011, p.205, 2011.
DOI : 10.1109/CVPR.2011.6044588

URL : https://hal.archives-ouvertes.fr/hal-00817961

L. Li, H. Su, L. Fei-fei, and E. P. Xing, Object bank: A high-level image representation for scene classification & semantic feature sparsification, Advances in neural information proceeding systems, pp.44-56, 2010.

T. Lindeberg, Feature detection with automatic scale selection, International Journal of Computer Vision, vol.30, issue.2, pp.79-116, 1998.
DOI : 10.1023/A:1008045108935

J. Liu, J. Luo, and M. Shah, Recognizing realistic actions from videos "in the wild, pp.66-67

L. Liu, L. Wang, and X. Liu, In defense of soft-assignment coding, p.144, 2011.

D. Lowe, Object recognition from local scale-invariant features, Proceedings of the Seventh IEEE International Conference on Computer Vision, pp.49-75, 1999.
DOI : 10.1109/ICCV.1999.790410

D. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol.60, issue.2, pp.91-110, 2004.
DOI : 10.1023/B:VISI.0000029664.99615.94

Z. Ma, Y. , Y. Nie, F. , and S. Nicu, Thinking of Images as What They Are: Compound Matrix Regression for Image Classification, International Joint Conferences on Artificial Intelligence (IJCAI), 2013. 99, p.110

J. Macqueen, Some methods for classification and analysis of multivariate observations, Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp.14-96, 1967.

T. Malisiewicz, A. Gupta, and A. Efros, Ensemble of exemplar-SVMs for object detection and beyond, 2011 International Conference on Computer Vision, 0200.
DOI : 10.1109/ICCV.2011.6126229

B. S. Manjunath and W. Ma, Texture features for browsing and retrieval of image data. Transactions on Pattern Analysis and Machine Intelligence, p.45, 1996.

J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs et al., Big data: The next frontier for innovation, competition, and productivity, p.33, 2011.

M. Marszalek, I. Laptev, and C. Schmid, Actions in context. In Computer Vision and Pattern Recognition, CVPR 2009. IEEE Conference on, pp.2929-2936, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00548645

M. Martin, Le langage cinématographique, Cerf, vol.75, p.131, 1985.

J. Matas, O. Chum, M. Urban, and T. Pajdla, Robust wide-baseline stereo from maximally stable extremal regions, Image and Vision Computing, vol.22, issue.10, pp.761-767, 2004.
DOI : 10.1016/j.imavis.2004.02.006

P. Matikainen, M. Hebert, and R. Sukthankar, Trajectons: Action recognition through the motion analysis of tracked features, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, p.95, 2009.
DOI : 10.1109/ICCVW.2009.5457659

M. Mazloom, E. Gavves, K. Van-de-sande, and C. Snoek, Searching informative concept banks for video event detection, Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, ICMR '13, pp.255-262
DOI : 10.1145/2461466.2461507

M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev, Semantic model vectors for complex video event recognition. Multimedia, IEEE Transactions on, vol.14, issue.56, pp.88-101

R. Messing, C. Pal, and H. Kautz, Activity recognition using the velocity histories of tracked keypoints, 2009 IEEE 12th International Conference on Computer Vision, pp.104-111, 2009.
DOI : 10.1109/ICCV.2009.5459154

V. Mezaris, A. Dimou, and I. Kompatsiaris, Local Invariant Feature Tracks for High-Level Video Feature Extraction, Proceedings of the 11th International Workshop on Image Analysis for Multimedia Interactive Services, pp.44-51, 2010.
DOI : 10.1007/978-1-4614-3831-1_10

. Microsoft, Microsoft kinect, 2013. URL http

K. Mikolajczyk and C. Schmid, Scale & affine invariant interest point detectors. IJCV, pp.163-171, 2004.
URL : https://hal.archives-ouvertes.fr/inria-00548554

F. Moosmann, D. Larlus, and F. Jurie, Learning saliency maps for object categorization, p.163, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00203726

O. R. Murthy and R. Goecke, Combined ordered and improved trajectories for large scale human action recognition

H. Naphade and T. Huang, A probabilistic framework for semantic video indexing , filtering, and retrieval. Multimedia, IEEE Transactions on, vol.60, p.78, 2002.

P. Natarajan, P. Natarajan, V. Manohar, S. Wu, S. Tsakalidis et al., Bbn viser trecvid 2011 multimedia event detection system, NIST TRECVID Workshop, p.62, 2011.

C. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman et al., <title>QBIC project: querying images by content, using color, texture, and shape</title>, Storage and Retrieval for Image and Video Databases, p.45, 1993.
DOI : 10.1117/12.143648

E. Nowak, F. Jurie, and B. Triggs, Sampling Strategies for Bag-of-Features Image Classification, p.49, 2006.
DOI : 10.1007/11744085_38

URL : https://hal.archives-ouvertes.fr/hal-00203752

T. Ojala, M. Pietikäinen, and T. Mäenpää, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.24, issue.7, pp.971-987, 2002.
DOI : 10.1109/TPAMI.2002.1017623

A. Oliva and A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International Journal of Computer Vision, vol.42, issue.3, pp.145-175, 2001.
DOI : 10.1023/A:1011139631724

D. Parikh and D. Batra, CRFs for Image Classification, p.61, 2003.

D. Parikh and T. Chen, Determining Patch Saliency Using Low-Level Context, p.163, 2008.
DOI : 10.1007/978-3-540-88688-4_33

A. Patron-perez, M. Marszalek, I. Reid, and A. Zisserman, Structured Learning of Human Interactions in TV Shows, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.12, pp.14-40, 2012.
DOI : 10.1109/TPAMI.2012.24

F. Perronnin and C. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 0197.
DOI : 10.1109/CVPR.2007.383266

F. Perronnin, J. Sánchez, and T. Mensik, Improving the fisher kernel for largescale image classification, Computer Vision?ECCV 2010, pp.143-156, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00548630

H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, Bilinear classifiers for visual recognition, Advances in Neural Information Processing Systems (NIPS), pp.1482-1490, 2009.

A. Popescu and N. Ballas, Cea list's participation at mediaeval 2012 placing task, p.77

R. Poppe, Vision-based human motion analysis: An overview. Computer vision and image understanding, pp.4-18, 2007.

R. Poppe, A survey on vision-based human action recognition, Image and Vision Computing, vol.28, issue.6, p.43, 2010.
DOI : 10.1016/j.imavis.2009.11.014

G. Qi, X. Hua, Y. Rui, J. Tang, T. Mei et al., Correlative multilabel video annotation, Proceedings of the 15th international conference on Multimedia, pp.17-26, 2007.

E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, Segmenting Salient Objects from Images and Videos, pp.168-169, 2010.
DOI : 10.1007/978-3-642-15555-0_27

K. Raja, I. Laptev, P. Pérez, and L. Oisel, Joint pose estimation and action recognition in image graphs, 2011 18th IEEE International Conference on Image Processing, p.55, 2011.
DOI : 10.1109/ICIP.2011.6116197

URL : https://hal.archives-ouvertes.fr/hal-01063329

D. Ramanan, Learning to parse images of articulated bodies, Advances in neural information processing systems, p.54, 2006.

M. Raptis, D. Kirovski, and H. Hoppe, Real-time classification of dance gestures from skeleton animation, Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA '11, pp.44-56, 2011.
DOI : 10.1145/2019406.2019426

K. Reddy and M. Shah, Recognizing 50 human action categories of web videos. MVA, 2012, pp.67-201

E. Renshaw and A. Särkkä, Gibbs point processes for studying the development of spatial-temporal stochastic processes, Computational Statistics & Data Analysis, vol.36, issue.1, pp.85-105, 2001.
DOI : 10.1016/S0167-9473(00)00028-1

M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele, What helps where &#x2013; and why? Semantic relatedness for knowledge transfer, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.910-917, 2010.
DOI : 10.1109/CVPR.2010.5540121

M. Ryoo, C. Chen, J. Aggarwal, and A. Roy-chowdhury, An overview of contest on semantic description of human activities (sdha) 2010. Recognizing Patterns in Signals, Speech, Images and Videos, pp.270-285, 2010.

S. Sadanand and J. J. Corso, Action bank: A high-level representation of activity in video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.69-193, 2012.
DOI : 10.1109/CVPR.2012.6247806

C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., pp.65-66
DOI : 10.1109/ICPR.2004.1334462

A. Shabou and H. Le-borgne, Locality-constrained and spatially regularized coding for scene categorization, 2012 IEEE Conference on Computer Vision and Pattern Recognition, p.52, 2012.
DOI : 10.1109/CVPR.2012.6248107

F. Shahbaz-khan, J. Van-de-weijer, and M. Vanrell, Top-down color attention for object recognition, 2009 IEEE 12th International Conference on Computer Vision, p.163, 2009.
DOI : 10.1109/ICCV.2009.5459362

O. Shamir and T. Zhang, Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes, Journal of Machine Learning Research, p.143, 2013.

G. Sharma, F. Jurie, and C. Schmid, Discriminative spatial saliency for image classification, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.162-163
DOI : 10.1109/CVPR.2012.6248093

URL : https://hal.archives-ouvertes.fr/hal-00714311

F. Shi, E. Petriu, and R. Laganiere, Sampling Strategies for Real-Time Action Recognition, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.66-193
DOI : 10.1109/CVPR.2013.335

S. Singh, A. Gupta, and A. A. Efros, Unsupervised Discovery of Mid-Level Discriminative Patches, p.77, 2012.
DOI : 10.1007/978-3-642-33709-3_6

R. Sivalingam, D. Boley, V. Morellas, and N. Papanikolopoulos, Positive definite dictionary learning for region covariances, 2011 International Conference on Computer Vision, pp.99-101, 2011.
DOI : 10.1109/ICCV.2011.6126346

J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings Ninth IEEE International Conference on Computer Vision, pp.95-96, 2003.
DOI : 10.1109/ICCV.2003.1238663

A. Smeaton, P. Over, and W. Kraaij, Evaluation campaigns and TRECVid, Proceedings of the 8th ACM international workshop on Multimedia information retrieval , MIR '06, pp.321-330, 2006.
DOI : 10.1145/1178677.1178722

A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Content-based image retrieval at the end of the early years. Pattern Analysis and Machine BIBLIOGRAPHY 227

J. Smith, A. Naphade, and . Natsev, Multimedia semantic indexing using model vectors, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698), 2003.
DOI : 10.1109/ICME.2003.1221649

S. T. Smith, Covariance, subspace, and intrinsic Crame/spl acute/r-Rao bounds, IEEE Transactions on Signal Processing, vol.53, issue.5, pp.1610-1630, 2005.
DOI : 10.1109/TSP.2005.845428

C. G. Snoek and M. Worring, Concept-Based Video Retrieval, Foundations and Trends?? in Information Retrieval, vol.2, issue.4, pp.215-322, 2008.
DOI : 10.1561/1500000014

C. G. Snoek, M. Worring, and A. W. Smeulders, Early versus late fusion in semantic video analysis, Proceedings of the 13th annual ACM international conference on Multimedia , MULTIMEDIA '05, pp.399-402, 2005.
DOI : 10.1145/1101149.1101236

B. Solmaz, S. M. Assari, and M. Shah, Classifying web videos using a global video descriptor. MVA, 2012, p.69

K. Soomro, A. Zamir, and M. Shah, Ucf101: A dataset of 101 human actions classes from videos in the wild, pp.68-185

J. Sun, X. Wu, S. Yan, L. Cheong, T. Chua et al., Hierarchical spatiotemporal context modeling for action recognition, Computer Vision and Pattern Recognition (CVPR, pp.62-77, 2009.

J. Sung, C. Ponce, B. Selman, and A. Saxena, Human activity detection from rgbd images, Plan, Activity, and Intent Recognition, p.64, 2011.

C. Sutton and A. Mccallum, An Introduction to Conditional Random Fields for Relational Learning. Introduction to statistical relational learning, pp.93-61, 2007.

J. B. Tenenbaum and W. T. Freeman, Separating Style and Content with Bilinear Models, Neural Computation, vol.13, issue.6, pp.1247-1283, 2000.
DOI : 10.1016/0167-6393(88)90018-0

M. Tenorth, J. Bandouch, and M. Beetz, The TUM Kitchen Data Set of everyday manipulation activities for motion tracking and action recognition, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp.1089-1096, 2009.
DOI : 10.1109/ICCVW.2009.5457583

L. Torresani, M. Szummer, and A. Fitzgibbon, Efficient Object Category Recognition Using Classemes, Computer Vision?ECCV 2010, pp.44-57, 2010.
DOI : 10.1007/978-3-642-15549-9_56

A. M. Treisman and G. Gelade, A feature-integration theory of attentation, Cognitive psychology, vol.157, p.159, 1980.

T. Tuytelaars, Dense interest points, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.2281-2288, 2010.
DOI : 10.1109/CVPR.2010.5539911

O. Tuzel, F. Porikli, and P. Meer, Region covariance: A fast descriptor for detection and classification UCF. Thumos: The first international workshop on action recogntion with a large number of classes, Computer Vision?ECCV, vol.99, issue.100188, pp.101-187, 2006.

L. Valet, G. Mauris, and P. Bolon, A statistical overview of recent literature in information fusion, Information Fusion Proceedings of the Third International Conference on, p.62, 2000.
URL : https://hal.archives-ouvertes.fr/hal-00514175

K. Van-de-sande, T. Gevers, and C. Snoek, Evaluating color descriptors for object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.32, issue.48, pp.1582-1596, 2010.

J. Van-gemert, C. Veenman, A. Smeulders, and J. Geusebroek, Visual word ambiguity. Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.32, issue.7, pp.1271-1283, 2010.

V. N. Vapnik and A. Y. Chervonekis, Is early vision optimized for extracting higher-order dependencies? Theory of Probability & Its Application, p.107, 1971.

F. Wang, Y. Ma, H. Zhang, and J. Li, A generic framework for semantic sports video analysis using dynamic bayesian networks, 2005.

F. Wang, Y. Jiang, and C. Ngo, Video event detection using motion relativity and visual relatedness, Proceeding of the 16th ACM international conference on Multimedia, MM '08, pp.239-248, 2008.
DOI : 10.1145/1459359.1459392

H. Wang and C. Schmid, Lear-inria submission for the thumos workshop, p.14

H. Wang, M. Ullah, A. Klaser, I. Laptev, and C. Schmid, Evaluation of local spatio-temporal features for action recognition, Procedings of the British Machine Vision Conference 2009, pp.75-78, 2009.
DOI : 10.5244/C.23.124

URL : https://hal.archives-ouvertes.fr/inria-00439769

H. Wang, A. Klaser, C. Schmid, and C. Liu, Action recognition by dense trajectories, CVPR 2011, pp.98-115, 0193.
DOI : 10.1109/CVPR.2011.5995407

URL : https://hal.archives-ouvertes.fr/inria-00583818

H. Wang, A. Kläser, C. Schmid, and C. Liu, Dense Trajectories and Motion Boundary Descriptors for Action Recognition, International Journal of Computer Vision, vol.73, issue.2, pp.1-20
DOI : 10.1007/s11263-012-0594-8

URL : https://hal.archives-ouvertes.fr/hal-00725627

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, p.51, 2013.
DOI : 10.1109/ICCV.2013.441

URL : https://hal.archives-ouvertes.fr/hal-00873267

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang et al., Locality-constrained Linear Coding for image classification, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3360-3367, 2010.
DOI : 10.1109/CVPR.2010.5540018

L. Wang, Y. Li, J. Jia, J. Sun, D. Wipf et al., Learning sparse covariance patterns for natural scenes, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp.2767-2774

Z. Wang, B. Fan, and F. Wu, Local intensity order pattern for feature description, ICCV. IEEE, p.163, 2011.

M. Weng and Y. Chuang, Multi-cue fusion for semantic video indexing, Proceeding of the 16th ACM international conference on Multimedia, MM '08, pp.71-80, 2008.
DOI : 10.1145/1459359.1459370

J. Weston and C. Watkins, Support vector machines for multi-class pattern recognition, ESANN, pp.61-72, 1999.

G. Willems, T. Tuytelaars, and L. Van-gool, An efficient dense and scaleinvariant spatio-temporal interest point detector, Computer Vision?ECCV, vol.50, pp.650-663, 2008.

L. Wolf, H. Jhuang, and T. Hazan, Modeling Appearances with Low-Rank SVM, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-6, 2007.
DOI : 10.1109/CVPR.2007.383099

S. Wu, O. Oreifej, and M. Shah, Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories, 2011 International Conference on Computer Vision, pp.1419-1426, 2011.
DOI : 10.1109/ICCV.2011.6126397

Y. Xiang, X. Zhou, Z. Liu, T. Chua, and C. Ngo, Semantic context modeling with maximal margin Conditional Random Fields for automatic image annotation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3368-3375, 2010.
DOI : 10.1109/CVPR.2010.5540015

R. Yan, M. Chen, and A. Hauptmann, Mining Relationship Between Video Concepts using Probabilistic Graphical Models, 2006 IEEE International Conference on Multimedia and Expo, pp.301-304, 2006.
DOI : 10.1109/ICME.2006.262458

J. Yang, K. Yu, Y. Gong, and T. Huang, Linear spatial pyramid matching using sparse coding for image classification, CVPR. IEEE, pp.88-96, 2009.

W. Yang, Y. Wang, and G. Mori, Recognizing human actions from still images with latent poses, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.2030-2037, 2010.
DOI : 10.1109/CVPR.2010.5539879

Y. Yang, Y. Yang, Z. Huang, H. T. Shen, and F. Nie, Tag localization with spatial correlations and joint group sparsity, CVPR 2011, pp.881-888, 2011.
DOI : 10.1109/CVPR.2011.5995499

A. Yao, J. Gall, G. Fanelli, and L. Van-gool, Does Human Action Recognition Benefit from Pose Estimation?, Procedings of the British Machine Vision Conference 2011, pp.44-56, 2011.
DOI : 10.5244/C.25.67

B. Yao and L. Fei-fei, Modeling mutual context of object and human pose in human-object interaction activities, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.17-24, 2010.
DOI : 10.1109/CVPR.2010.5540235

A. L. Yarbus, B. Haigh, and L. A. Rigss, Eye movements and vision, p.159, 1967.
DOI : 10.1007/978-1-4899-5379-7

H. Yu, M. Li, H. Zhang, and J. Feng, Color texture moments for contentbased image retrieval, International Conference on Image Processing, p.45, 2002.

K. Yu, T. Zhang, and Y. Gong, Nonlinear learning using local coordinate coding, NIPS, vol.96, p.97, 2009.

K. Yu, Y. Lin, and J. Lafferty, Learning image representations from the pixel level via hierarchical sparse coding, CVPR 2011, pp.99-101
DOI : 10.1109/CVPR.2011.5995732